Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:πŸŽ‰ Flow is ready to use!
	πŸ”— Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	πŸ”’ Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
β Ή       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
ο»ΏThe Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
(D)arth (S)ide of the (L)og4j (F)orce, the ultimate log4j vulnerabilities assessor

DSLF DSLF stands for (D)arth (S)ide of the (L)og4j (F)orce. It is the ultimate log4j vulnerabilities assessor. It comes with four individual Python3 m

frontal 1 Jan 11, 2022
Simplify getting and using cookies from the browser to use in Python.

CookieCache Simplify getting and using cookies from the browser to use in Python. NOTE: All the logic to interface with the browsers is done by the Br

pat_h/to/file 2 May 06, 2022
A way to analyse how malware and/or goodware samples vary from each other using Shannon Entropy, Hausdorff Distance and Jaro-Winkler Distance

A way to analyse how malware and/or goodware samples vary from each other using Shannon Entropy, Hausdorff Distance and Jaro-Winkler Distance

11 Nov 15, 2022
This little tool is to calculate a MurmurHash value of a favicon to hunt phishing websites on the Shodan platform.

MurMurHash This little tool is to calculate a MurmurHash value of a favicon to hunt phishing websites on the Shodan platform. What is MurMurHash? Murm

Viral Maniar 87 Dec 31, 2022
Big-Papa Integrates Javascript and python for remote cookie stealing which then can be used for session hijacking

Big-Papa is a remote cookie stealer which can then be used for session hijacking and Bypassing 2 Factor Authentication

77 Jan 03, 2023
Salesforce Recon and Exploitation Toolkit

Salesforce Recon and Exploitation Toolkit Salesforce Recon and Exploitation Toolkit Usage python3 main.py URL References Announcement Blog - https:/

81 Dec 23, 2022
A local Socks5 server written in python, used for integrating Multi-hop

proxy-Zata proxy-Zata v1.0 This is a local Socks5 server written in python, used for integrating Multi-hop (Socks4/Socks5/HTTP) forward proxy then pro

4 Feb 24, 2022
Mert Güvençli 142 Jan 05, 2023
A tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine or expire obfuscated scripts.

PyArmor Homepage (δΈ­ζ–‡η‰ˆη½‘η«™) Documentation(δΈ­ζ–‡η‰ˆ) PyArmor is a command line tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine

Dashingsoft 1.9k Dec 30, 2022
the metasploit script(POC/EXP) about CVE-2021-22005 VMware vCenter Server contains an arbitrary file upload vulnerability

CVE-2021-22005-metasploit the metasploit script(POC/EXP) about CVE-2021-22005 VMware vCenter Server contains an arbitrary file upload vulnerability pr

Taroballz 25 Nov 15, 2022
PrivateRoom - Make your work private by building a system using arduino which instantly kills a program when someone enters your room/cabin

privateRoom Make your work private by building a system using arduino which instantly kills a program when someone enters your room/cabin STEPS: Uploa

Divyanshu Kumar 3 Nov 08, 2022
Moodle community-based vulnerability scanner

badmoodle Moodle community-based vulnerability scanner Description badmoodle is an unofficial community-based vulnerability scanner for moodle that sc

Michele Di Bonaventura 11 Dec 22, 2022
test application for the licence key web app.

licence_software_test_app Make sure you set your database values in a .env file to the folder. Install MYSQL connector: pip install mysql-connector-py

Carl Beattie 1 Oct 28, 2021
POC using subprocess lib in Python 🐍

POC subprocess ☞ POC using the subprocess library with Python. References: https://github.com/GuillaumeFalourd/poc-subprocess https://geekflare.com/le

Guillaume Falourd 2 Nov 28, 2022
Bypass ReCaptcha: A Python script for dealing with recaptcha

Bypass ReCaptcha Bypass ReCaptcha is a Python script for dealing with recaptcha.

Marcos Camargo 1 Jan 11, 2022
PyPasser is a Python library for bypassing reCaptchaV3 only by sending 2 requests.

PyPasser is a Python library for bypassing reCaptchaV3 only by sending 2 requests. In 1st request, gets token of captcha and in 2nd request,

253 Jan 05, 2023
Python library to prevent XSS(cross site scripting attach) by removing harmful content from data.

A tool for removing malicious content from input data before saving data into database. It takes input containing HTML with XSS scripts and returns va

2 Jul 05, 2022
Spring Cloud Gateway < 3.0.7 & < 3.1.1 Code Injection (RCE)

Spring Cloud Gateway 3.0.7 & 3.1.1 Code Injection (RCE) CVE: CVE-2022-22947 CVSS: 10.0 (Vmware - https://tanzu.vmware.com/security/cve-2022-22947)

Carlos Vieira 35 Dec 28, 2022
Malware for Discord, designed to steal passwords, tokens, and inject discord folders for long-term use.

Vital What is Vital? Vital is malware primarily used to collect and extract information from the Discord desktop client. While it has other features (

HellSec 59 Dec 01, 2022
CVE-2022-22963 PoC

CVE-2022-22963 CVE-2022-22963 PoC Slight modified for English translation and detection of https://github.com/chaosec2021/Spring-cloud-function-SpEL-R

Nicolas Krassas 104 Dec 08, 2022