Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Overview

Product Reviews Summarizer

Version 1.0.0

A quick guide on installation of important libraries and running the code.

The project has three .ipynb files - Data Scraper.ipynb, cosine-similarity-wo-tf-idf.ipynb, and cosine-similarity-w-tf-idf.ipynb.


Data Scraper

For the Data Scraper python script, we need to import the following three libraries - requests, BeautifulSoup, and pandas. The installation process can be viewed by clicking on the respective library names.

Splash

In this project, instead of using the default web browser to scrape data, we have created a splash container using docker. Splash is a light-weight javascript rendering service with an HTTP API. For easy installation, you can watch this amazing video by John Watson Rooney on YouTube.

https://www.youtube.com/watch?v=8q2K41QC2nQ&t=361s

Note: You need to make sure that you give the Splash Localhost URL to the requests.get().

Running the code

After you have installed and configured everything, you can run the code by providing the URL of your choice. Suppose, you are taking a product from Amazon, make sure to go to All Reviews page and go to page #2. Copy this URL upto the last '=' and paste it as an f-string in the code. Add a '{x}' after the '='. The code is ready to run. It will scrape the product name, review title, star rating, and the review body from each page, until the last page is encountered, and save it in .xlsx format.

Note: Specify the required output name and destination.


cosine-similarity-wo-tf-idf

For the cosine similarity model, first we need to download the pretrained GloVe Word Embeddings. Run the Load GloVe Word Embeddings section in the script once. It is only required if the kernel is restarted.

For this script, we need to import the following libraries - numpy, pandas, nltk, nltk.tokenize, nltk.corpus, re, sklearn.metrics.pairwise, networkx, transformers, and time. Also run the nltk.download('punkt') and nltk.download('stopwords') lines to download them.

Next step is to load the data as a dataframe. Make sure to give the correct address. Pre-processing of the reviews is done for efficient results. The pre-processing steps include converting to string datatype, converting alphabetical characters to lowercase, removing stopwords, replacing non-alphabetical characters with blank character and tokenizing the sentences.

The pre-processed data is then grouped based on star ratings and sent to the cosine similarity and pagerank algorithm. The top 10 ranked sentences after the applying the pagerank algorithm are sent to huggingface transformers to create an extractive summary (min_lenght = 75, max_length = 300). The summary, along with the product name, star rating, no of reviews, % of total reviews, and the top 5 frequent words along with the count are saved in .xlsx format.

Note: Specify the required output name and destination.


cosine-similarity-w-tf-idf

For this model, along with the above libraries, we need to import the following additional libraries - spacy, and heapq. The cosine similarity algorithm has a time complexity of O(n^2). In order to have a fast execution, in this method, we are using tf-idf measure to score the frequent words, and hence the corresponding sentences. Only the top 1000 sentences are then sent to the cosine similarity algorithm. Usage of the tf-idf measure, ensures that each product, irrespective of the number of sentences in the reviews, gives an output within 120 seconds. This method makes sure no important feature is lost, giving similar results as the previous method but in considerately less time.


Contributors

© Parv Bhatt © Namratha Sri Mateti © Dominic Thomas


Owner
Parv Bhatt
Masters in Data Analytics Student at Penn State University
Parv Bhatt
The Easy-to-use Dialogue Response Selection Toolkit for Researchers

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

GMFTBY 32 Nov 13, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022
Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 21.2k Dec 30, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
DLO8012: Natural Language Processing & CSL804: Computational Lab - II

NATURAL-LANGUAGE-PROCESSING-AND-COMPUTATIONAL-LAB-II DLO8012: NLP & CSL804: CL-II [SEMESTER VIII] Syllabus NLP - Reference Books THE WALL MEGA SATISH

AMEY THAKUR 7 Apr 28, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu 1.6k Jan 03, 2023
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
A desktop GUI providing an audio interface for GPT3.

Jabberwocky neil_degrasse_tyson_with_audio.mp4 Project Description This GUI provides an audio interface to GPT-3. My main goal was to provide a conven

16 Nov 27, 2022