Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	🔒 Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
⠹       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
Quickstart resources for the WiFi Nugget, a cat themed WiFi Security platform for beginners.

Quickstart resources for the WiFi Nugget, a cat themed WiFi Security platform for beginners.

HakCat 62 Jan 08, 2023
PoC of proxylogon chain SSRF(CVE-2021-26855) to write file by testanull, censored by github

CVE-2021-26855 PoC of proxylogon chain SSRF(CVE-2021-26855) to write file by testanull, censored by github Why does github remove this exploit because

The Hacker's Choice 58 Nov 15, 2022
An interactive python script that enables root access on the T-Mobile (Wingtech) TMOHS1, as well as providing several useful utilites to change the configuration of the device.

TMOHS1 Root Utility Description An interactive python script that enables root access on the T-Mobile (Wingtech) TMOHS1, as well as providing several

40 Dec 29, 2022
A tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine or expire obfuscated scripts.

PyArmor Homepage (中文版网站) Documentation(中文版) PyArmor is a command line tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine

Dashingsoft 1.9k Dec 30, 2022
FTP-Exploits is a tool made in python that contains 4 diffrent types of ftp exploits that can be used in Penetration Testing.

FTP-exploits FTP-exploits is a tool which is used for Penetration Testing that can run many kinds of exploits on port 21(FTP) Commands and Exploits Ex

1 Dec 26, 2021
Gmail Accounts Hacking

gmail-hack Gmail Accounts Hacking Gemail-Hack python script for Hack gmail account brute force What is brute force attack? In brute force attack,scrip

Aryan 25 Nov 10, 2022
Raphael is a vulnerability scanning tool based on Python3.

Raphael Raphael是一款基于Python3开发的插件式漏洞扫描工具。 Raphael is a vulnerability scanning too

b4zinga 5 Mar 21, 2022
Mass Check Vulnerable Log4j CVE-2021-44228

Log4j-CVE-2021-44228 Mass Check Vulnerable Log4j CVE-2021-44228 Introduction Actually I just checked via Vulnerable Application from https://github.co

Justakazh 6 Dec 28, 2022
Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

12 Sep 28, 2022
WhPhisher: a Phishing tool With Python

WhPhisher Herramienta para hacer phishing con muchos métodos de túneling -----Como Instalarlo------- pkg install python3 pkg install git git clone htt

WhBeatZ 80 Jan 02, 2023
Tenssens framework focused on gathering information from free tools or resources. The intention is to help people find free OSINT resources.

Tenssens framework focused on gathering information from free tools or resources. The intention is to help people find free OSINT resources.

Md. Nur habib 31 Oct 21, 2022
S2-061 的payload,以及对应简单的PoC/Exp

S2-061 脚本皆根据vulhub的struts2-059/061漏洞测试环境来写的,不具普遍性,还望大佬多多指教 struts2-061-poc.py(可执行简单系统命令) 用法:python struts2-061-poc.py http://ip:port command 例子:python

dreamer 46 Oct 20, 2022
A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications

This project is no longer maintained March 2020 Update: Please go see the amazing Pysa tutorial that should get you up to speed finding security vulne

2.1k Dec 25, 2022
Suricata Language Server is an implementation of the Language Server Protocol for Suricata signatures

Suricata Language Server is an implementation of the Language Server Protocol for Suricata signatures. It adds syntax check, hints and auto-completion to your preferred editor once it is configured.

Stamus Networks 39 Nov 28, 2022
Subdomain enumeration,Web scraping and finding usernames automation script written in python

Subdomain enumeration,Web scraping and finding usernames automation script written in python

Syam 12 Nov 22, 2022
Phishing Campaign Toolkit

King Phisher Phishing Campaign Toolkit Installation For instructions on how to install, please see the INSTALL.md file. After installing, for instruct

RSM US LLP 1.9k Jan 01, 2023
A simple multi-threaded distributed SSH brute-forcing tool written in Python.

OrbitalDump A simple multi-threaded distributed SSH brute-forcing tool written in Python. How it Works When the script is executed without the --proxi

K4YT3X 408 Jan 03, 2023
Buff A simple BOF library I wrote under an hour to help me automate with BOF attack

What is Buff? A simple BOF library I wrote under an hour to help me automate with BOF attack. It comes with fuzzer and a generic method to generate ex

0x00 3 Nov 21, 2022
A repository to detect the ARP spoofing in any devices and prevent Man in the Middle(MITM) attack using Python3

arp_spoof_detector A repository to detect the ARP spoofing in any devices and prevent Man in the Middle(MITM) attack using Python3 Usage: git clone ht

Surya Das N 1 Oct 30, 2021
聚合Github上已有的Poc或者Exp,CVE信息来自CVE官网。Auto Collect Poc Or CVE from Github by CVE ID.

PocOrExp in Github 聚合Github上已有的Poc或者Exp,CVE信息来自CVE官网 注意:只通过通用的CVE号聚合,因此对于MS17-010等Windows编号漏洞以及著名的有绰号的漏洞,还是自己检索一下比较好 Usage python3 exp.py -h usage: ex

567 Dec 30, 2022