Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:πŸŽ‰ Flow is ready to use!
	πŸ”— Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	πŸ”’ Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
β Ή       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
ο»ΏThe Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
Meterpreter Reverse shell over TOR network using hidden services

Poiana Reverse shell over TOR network using hidden services Features - Create a hidden service - Generate non-staged payload (python/meterpreter_rev

calfcrusher 80 Dec 21, 2022
Chrome Post-Exploitation is a client-server Chrome exploit to remotely allow an attacker access to Chrome passwords, downloads, history, and more.

ChromePE [Linux/Windows] Chrome Post-Exploitation is a client-server Chrome exploit to remotely allow an attacker access to Chrome passwords, download

Finn Lancaster 3 Oct 05, 2022
Flutter Reverse Engineering Framework

This framework helps reverse engineer Flutter apps using patched version of Flutter library which is already compiled and ready for app repacking. There are changes made to snapshot deserialization p

PT SWARM 910 Jan 01, 2023
Exploiting CVE-2021-44228 in Unifi Network Application for remote code execution and more

Log4jUnifi Exploiting CVE-2021-44228 in Unifi Network Application for remote cod

96 Jan 02, 2023
A simple subdomain scanner in python

Subdomain-Scanner A simple subdomain scanner in python ✨ Features scans subdomains of a domain thats it! πŸ’β€β™€οΈ How to use first download the scanner.p

Portgas D Ace 2 Jan 07, 2022
Proof of concept for CVE-2021-24086, a NULL dereference in tcpip.sys triggered remotely.

CVE-2021-24086 This is a proof of concept for CVE-2021-24086 ("Windows TCP/IP Denial of Service Vulnerability "), a NULL dereference in tcpip.sys patc

Axel Souchet 220 Dec 14, 2022
CVE-2022-23046 - SQL Injection Vulnerability on PhpIPAM v1.4.4

CVE-2022-23046 PhpIPAM v1.4.4 allows an authenticated admin user to inject SQL s

2 Feb 15, 2022
Whois-Python - Get Whois Domain with Python GUI

Whois-Python-GUI Get Whois Domain with Python - GUI :) WARNING Dont Copy ! - W

MR.D3F417 3 Feb 21, 2022
A Docker based LDAP RCE exploit demo for CVE-2021-44228 Log4Shell

log4j-poc An LDAP RCE exploit for CVE-2021-44228 Log4Shell Description This demo Tomcat 8 server has a vulnerable app deployed on it and is also vulne

60 Dec 10, 2022
Fat-Stealer is a stealer that allows you to grab the Discord token from a user and open a backdoor in his machine.

Fat-Stealer is a stealer that allows you to grab the Discord token from a user and open a backdoor in his machine.

Jet Berry's 21 Jan 01, 2023
Backdoor is a term that refers to the access of the software or hardware of a computer system without being detected.

This program is an non-object oriented opensource, hidden and undetectable backdoor/reverse shell/RAT for Windows made in Python 3 which contains many features such as multi-client support and cross-

35 Apr 17, 2022
IDA Python Script for anti ollvm

IDA Python Script for anti ollvm

Shocker 62 Dec 23, 2022
This is a simple tool to create ZIP payloads using a provided wordlist for the symlink attack (present in some file upload vulnerabilities)

zip-symlink-payload-creator This is a simple tool to create ZIP payloads using a provided wordlist for the symlink attack (present in some file upload

stark0de 6 Aug 18, 2022
πŸ’£ Bomb Crypto Bot πŸ’£

πŸ’£ Bomb Crypto Bot πŸ’£ ⚠️ Warning I am not responsible for any penalties incurred by those who use the bot, use it at your own risk. πŸ“„ Documentation -

Matheus Benites 4 Apr 27, 2022
Having a weak password is not good for a system that demands high confidentiality and security of user credentials

Having a weak password is not good for a system that demands high confidentiality and security of user credentials. It turns out that people find it difficult to make up a strong password that is str

PyLaboratory 0 Feb 07, 2022
INFO 3350/6350, Spring 2022, Cornell

Information Science 3350/6350 Text mining for history and literature Staff and sections Instructor: Matthew Wilkens Graduate TAs: Federica Bologna, Ro

Wilkens Teaching 6 Feb 21, 2022
Steal Files on a Windows Machine

File-Stealer Steal Files on a Windows Machine About This Script will steal certain Files on a Windows Machine and sends them to a FTP Server. Preview

Marcel 5 Nov 17, 2022
Proof of concept to check if hosts are vulnerable to CVE-2021-41773

CVE-2021-41773 PoC Proof of concept to check if hosts are vulnerable to CVE-2021-41773. Description (https://cve.mitre.org/cgi-bin/cvename.cgi?name=CV

Jordan Jay 43 Nov 09, 2022
A tool to brute force a gmail account. Use this tool to crack multiple accounts

A tool to brute force a gmail account. Use this tool to crack multiple accounts. This tool is developed to crack multiple accounts

Saad 12 Dec 30, 2022
Dahua IPC/VTH/VTO devices auth bypass exploit

CVE-2021-33044 Dahua IPC/VTH/VTO devices auth bypass exploit About: The identity authentication bypass vulnerability found in some Dahua products duri

Ashish Kunwar 23 Dec 02, 2022