A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

Last update: Jan 04, 2023

Related tags

Machine Learning py-upset

Overview

pyUpSet

A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

Purpose

How to install

How it works

A note on the input format

Upcoming changes

Purpose

The purpose of this package is to statically reproduce some of the visualisations that can be obtained through the UpSet tool of Lex, Gehlenborg et al.

In particular, pyUpSet strengthens UpSet's focus on intersections, which motivates many of the design choices behind the exposed interface and the internal mechanics of the module. (More on this below.)

Consistently with the documentation used for Lex et al.'s UpSet, the data employed in the following examples comes from the movie data set of the GroupLens Labs.

How to install

pyUpSet is on PyPI and can therefore be installed via pip:

pip install pyupset

If you'd rather install from source, you can download and unzip the tar archive (in pyupset/dist/) and run

python setup.py install

How it works

The current interface is very simple: Plots can be generated solely from the exposed function plot, whose arguments allow flexible customisations of the graphs. The easiest example is the plain, straightforward basic intersection plot:

import pyupset as pyu
from pickle import load
with open('./test_data_dict.pckl', 'rb') as f:
   data_dict = load(f)
pyu.plot(data_dict)

to produce

N.B.: Notice that intersections are exclusive, meaning that they form a partition of the union of the base sets.

Displayed intersections can also be filtered or sorted by size or degree:

pyu.plot(data_dict, unique_keys = ['title'], sort_by='degree', inters_size_bounds=(20, 400))

produces

The example above also uses the unique_keys kwarg, which specifies columns of the underlying data frames in data_dict that can be used to uniquely identify rows and possibly speed up the computation of intersections.

Intersection highlighting

pyUpSet supports "queries", i.e. the highlighting of intersections. Intersections to highlight are specified through tuples. For example, the following call produces graphs where all data is highlighted that corresponds to movies classified as both "adventure" and "action", or "romance" and "war".

pyu.plot(data_dict, unique_keys = ['title'], 
         additional_plots=[{'kind':'scatter', 'data_quantities':{'x':'views', 'y':'rating_std'}},
                           {'kind':'hist', 'data_quantities':{'x':'views'}}],
         query = [('adventure', 'action'), ('romance', 'war')]
        )

Additional plots

It is possible to add further plots that use information contained in the data frames, as in

pyu.plot(data_dict, unique_keys = ['title'], 
         additional_plots=[{'kind':'scatter', 'data_quantities':{'x':'views', 'y':'rating_std'}},
                           {'kind':'hist', 'data_quantities':{'x':'views'}}]), 
         query = [('adventure', 'action'), ('romance', 'war')]

This produces

The highlighting produced by the queries is passed to the additional graphs. The dictionary specifying the additional graphs can also take standard matplotlib arguments as kwargs:

pyu.plot(data_dict, unique_keys = ['title'], 
        additional_plots=[{'kind':'scatter', 
                           'data_quantities':{'x':'views', 'y':'rating_std'},
                           'graph_properties':{'alpha':.8, 'lw':.4, 'edgecolor':'w', 's':50}},
                          {'kind':'hist', 
                           'data_quantities':{'x':'views'},
                           'graph_properties':{'bins':50}}], 
        query = [('adventure', 'action'), ('romance', 'war')])

yields

A note on the input format

pyUpSet has a very specific use case: It is focussed on the study of intersections of sets. In order for a definition of intersection to make sense, and even more for the integration of additional graphs to be meaningful, it is assumed that the input data frames have properties of homonymy (they contain columns with the same names) and homogeneity (columns with the same name, intuitively, contain data of the same kind). While hononymy is a purely interface-dependent requirement whose aim is primarily to make pyUpSet's interface leaner, homogeneity has a functional role in allowing definitions of uniqueness and commonality for the data points in the input data frames.

Whenever possible, pyUpSet will try to check for (and enforce) the two above properties. In particular, when the unique_keys argument of plot is omitted, pyUpSet will try to use all columns with common names across the data frames as a list of unique keys. Under the hypotheses of homogeneity and homonymy this should be enough for all the operations carried out by pyUpSet to complete successfully.

Upcoming changes

Please bear in mind that pyUpset is under active development so current behaviour may change at any time. In particular, here is a list of changes, in no particular order, to be expected soon:

improved OO interface for increased flexibility and customisation
improved, automated scaling of figure and axes grid according to the number of sets, intersections and additional plots (at the moment manual resizing may be needed)

Comments

Input format

Hi

I'm struggling to format my data for py-upset.

As far as I can see from the sample data, each set is a dictionary key, and the data to be compared between sets is a series corresponding to a key.

I created a dataframe with 2 columns- the first containing the set name and the second containing the corresponding strings:

I then created a dictionary with the set name as a key and the strings belonging to a given set as a series:

I ran the commands:

% matplotlib inline import pyupset as pyu pyu.plot(be_dict)

...and received the error: AttributeError: 'Series' object has no attribute 'columns'

Any help would be appreciated.

opened by MarlaWillemse 2

Add Py-Upset to UpSet.App Webpage

Hi,

I'm Alex, I'm the original developer of the first UpSet and the first-author on the UpSet paper.

I'm reaching out because I've been working on a website about UpSet and all the different applications: https://upset.app/. I'm excited about your implementation and would love to include it. In that context, I have a question and a request:

Would it be OK if I use the image I've included on this page: https://upset.app/versions/
Would you be so kind to provide the information for your implementation, based on the template I've included below. Please feel free to describe your version at the bottom. Alternatively, you could also submit a pull request here: https://github.com/visdesignlab/upset-app/tree/main/_upsetversions

Thanks!

---
layout: default
key: upset_original
name: The original UpSet
type: Interactive, Web-Based
source: https://github.com/VCG/upset
web: http://vcg.github.io/upset/
documentation: https://github.com/VCG/upset/wiki
image: upset_original.png
authors:  Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, and Hanspeter Pfister
publication: https://vdl.sci.utah.edu/publications/2014_infovis_upset/
language: JavaScript
license: MIT License
maintained: no
interactive: yes
inline-attribute-vis: yes
attribute-views: yes
aggregation: yes
item-queries: no
set-queries: yes
shows-deviation: yes
export: no
format-table: yes
format-list: no
format-set-expression: no
---
The original UpSet, developed to go with the original paper, as an interactive web application. This version supports most advanced features. It lacks simple data upload functionality, so that it either has to be hosted locally, or pointed to a globally visible data file. Unfortunately, the original UpSet is no longer actively maintained.

The items:

inline-attribute-vis:
attribute-views:
aggregation:
item-queries:
set-queries:
shows-deviation:
format-table: yes
format-list: no
format-set-expression: no

refer to the complex features and data formats explained here https://upset.app/advanced/

opened by alexsb 0

AttributeError: 'DataFrame' object has no attribute 'ix'

Hi, I installed pyupset on COLAB. I get AttributeError: 'DataFrame' object has no attribute 'ix' when trying to run simple plot(data_dict, unique_keys=None)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-26-a1c443ff803d> in <module>()
      7 import pyupset
      8 
----> 9 pplot= pyupset.plot(classesDict(dataset1), unique_keys = ['frame_id'])

3 frames
/usr/local/lib/python3.7/dist-packages/pyupset/visualisation.py in plot(data_dict, unique_keys, sort_by, inters_size_bounds, inters_degree_bounds, additional_plots, query)
     56     all_columns = list(all_columns)
     57 
---> 58     plot_data = DataExtractor(data_dict, all_columns)
     59     ordered_inters_sizes, ordered_in_sets, ordered_out_sets = \
     60         plot_data.get_filtered_intersections(sort_by,inters_size_bounds,inters_degree_bounds)

/usr/local/lib/python3.7/dist-packages/pyupset/visualisation.py in __init__(self, data_dict, unique_keys)
    510                                                                                             unique_keys)
    511         self.in_sets_list, self.inters_degrees, \
--> 512         self.out_sets_list, self.inters_df_dict = self.extract_intersection_data()
    513 
    514 

/usr/local/lib/python3.7/dist-packages/pyupset/visualisation.py in extract_intersection_data(self)
    569                 exclusive_intersection = exclusive_intersection.difference(pd.Index(self.df_dict[s][
    570                     self.unique_keys]))
--> 571             final_df = self.df_dict[seed].set_index(pd.Index(self.df_dict[seed][self.unique_keys])).ix[
    572                 exclusive_intersection].reset_index(drop=True)
    573             inters_dict[in_sets] = final_df

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   5139             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5140                 return self[name]
-> 5141             return object.__getattribute__(self, name)
   5142 
   5143     def __setattr__(self, name: str, value) -> None:

``

opened by Gftakla 2

python3.7 IndexError: GridSpec slice would result in no space allocated for subplot

Folloing Error:

pyu.plot(data_dict, unique_keys = ['title'], sort_by='degree', inters_size_bounds=(20, 400))
/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py:571: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  final_df = self.df_dict[seed].set_index(pd.Index(self.df_dict[seed][self.unique_keys])).ix[
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py", line 63, in plot
    upset = UpSetPlot(len(ordered_dfs), len(ordered_in_sets), additional_plots, query)
  File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py", line 127, in __init__
    self.ax_setsize, self.ax_tablenames, self.additional_plots_axes = self._prepare_figure(additional_plots)
  File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py", line 180, in _prepare_figure
    ax_setsize = plt.subplot(gs_top[-1:-setsize_h, 0:setsize_w])
  File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/matplotlib/gridspec.py", line 170, in __getitem__
    [_normalize(k1, nrows, 0), _normalize(k2, ncols, 1)],
  File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/matplotlib/gridspec.py", line 150, in _normalize
    raise IndexError("GridSpec slice would result in no space "
IndexError: GridSpec slice would result in no space allocated for subplot

Be like version not comptible, which version you used.

opened by GrandH2O 0

Data input format?

Can you please include in your readme.md how to structure incoming data? I can't see anywhere what format my data frame needs to be in, in order to render a graph. The only solution is to grab and unpickle your test data which defeats the point of your readme.md instructions.

opened by G-kodes 6
Bug? Numbers don't seem to add up

I have a set of three TSV files which I am reading as pandas.DataFrames. Because the data are being prepared for a manuscript in review, I will not share them here. I hope that my description of these files is sufficient to track down the problem.

Basically, I am looking to perform an upset of genes with significant detections of splicing QTLs between tissues. Genes can have multiple splicing QTLs associated with them (multiple splicing events, multiple genomic variants). In one such test, I observe total gene counts on the order of 5-6e4 for each of 3 tissues. However, the intersection of all 3 is on the order of 2e6. This brings into doubt the assumption that the intersections are being computed correctly. My guess is that the intersection does not properly filter for unique intersecting rows.

opened by PikalaxALT 0
Values instead of exponential

Any idea how can i show the intersection size with exact value instead of the exponential.

Here's i have highlighted the exponential that i'd like to be shown in numbers

opened by waqarali141 1

Releases(v0.1.post3)

v0.1.post3(Nov 8, 2015)

Just minor technical fixes to the distribution, no major update.
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

653 Dec 27, 2022

TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

A powerful and flexible machine learning platform for drug discovery

1.1k Jan 08, 2023

Bodywork deploys machine learning projects developed in Python, to Kubernetes.

Bodywork deploys machine learning projects developed in Python, to Kubernetes. It helps you to: serve models as microservices execute batch jobs run r

409 Jan 01, 2023

Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list Uber Open Source 997 Dec 30, 2022

Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Trading Tesla with Machine Learning and Sentiment Analysis An interactive program to train a Random Forest Classifier to predict Tesla daily prices us

31 Nov 17, 2022

Xeasy-ml is a packaged machine learning framework.

xeasy-ml 1. What is xeasy-ml Xeasy-ml is a packaged machine learning framework. It allows a beginner to quickly build a machine learning model and use

9 Mar 14, 2022

Titanic Traveller Survivability Prediction

The aim of the mini project is predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and more.

0 Jan 20, 2022

Real-time stream processing for python

Streamz Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelin

1.1k Dec 28, 2022

Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processi

2.8k Jan 05, 2023

MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

MosaicML Composer MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training. We aim to ease th

2.8k Jan 06, 2023

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

720 Dec 25, 2022

Binary Classification Problem with Machine Learning

Binary Classification Problem with Machine Learning Solving Approach: 1) Ultimate Goal of the Assignment: This assignment is about solving a binary cl

0 Jan 20, 2022

🤖 ⚡ scikit-learn tips

🤖 ⚡ scikit-learn tips New tips are posted on LinkedIn, Twitter, and Facebook. 👉 Sign up to receive 2 video tips by email every week! 👈 List of all

1.6k Jan 03, 2023

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

65 Dec 20, 2022

Client - 🔥 A tool for visualizing and tracking your machine learning experiments

Weights and Biases Use W&B to build better models faster. Track and visualize all the pieces of your machine learning pipeline, from datasets to produ

5.2k Jan 03, 2023

Programming assignments and quizzes from all courses within the Machine Learning Engineering for Production (MLOps) specialization offered by deeplearning.ai

Machine Learning Engineering for Production (MLOps) Specialization on Coursera (offered by deeplearning.ai) Programming assignments from all courses i

173 Jan 05, 2023

TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

785 Dec 21, 2022

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

5 Apr 05, 2022

A Python library for choreographing your machine learning research.

270 Jan 06, 2023

JMP is a Mixed Precision library for JAX.

Mixed precision training [0] is a technique that mixes the use of full and half precision floating point numbers during training to reduce the memory bandwidth requirements and improve the computatio

108 Dec 31, 2022

A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

Related tags

Overview

pyUpSet

Contents

Purpose

How to install

How it works

Intersection highlighting

Additional plots

A note on the input format

Upcoming changes

Comments

Input format

Add Py-Upset to UpSet.App Webpage

AttributeError: 'DataFrame' object has no attribute 'ix'

python3.7 IndexError: GridSpec slice would result in no space allocated for subplot

Data input format?

Bug? Numbers don't seem to add up

Values instead of exponential

Releases(v0.1.post3)

v0.1.post3(Nov 8, 2015)

Owner

Sequence learning toolkit for Python

TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

Bodywork deploys machine learning projects developed in Python, to Kubernetes.

Distributed Computing for AI Made Simple

Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Xeasy-ml is a packaged machine learning framework.

Titanic Traveller Survivability Prediction

Real-time stream processing for python

Merlion: A Machine Learning Framework for Time Series Intelligence

MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Binary Classification Problem with Machine Learning

🤖 ⚡ scikit-learn tips

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Client - 🔥 A tool for visualizing and tracking your machine learning experiments

Programming assignments and quizzes from all courses within the Machine Learning Engineering for Production (MLOps) specialization offered by deeplearning.ai

TensorFlow implementation of an arbitrary order Factorization Machine

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

A Python library for choreographing your machine learning research.

JMP is a Mixed Precision library for JAX.