A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

Overview

pyUpSet

A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

Contents

Purpose

How to install

How it works

A note on the input format

Upcoming changes

Purpose

The purpose of this package is to statically reproduce some of the visualisations that can be obtained through the UpSet tool of Lex, Gehlenborg et al.

In particular, pyUpSet strengthens UpSet's focus on intersections, which motivates many of the design choices behind the exposed interface and the internal mechanics of the module. (More on this below.)

Consistently with the documentation used for Lex et al.'s UpSet, the data employed in the following examples comes from the movie data set of the GroupLens Labs.

How to install

pyUpSet is on PyPI and can therefore be installed via pip:

pip install pyupset

If you'd rather install from source, you can download and unzip the tar archive (in pyupset/dist/) and run

python setup.py install

How it works

The current interface is very simple: Plots can be generated solely from the exposed function plot, whose arguments allow flexible customisations of the graphs. The easiest example is the plain, straightforward basic intersection plot:

import pyupset as pyu
from pickle import load
with open('./test_data_dict.pckl', 'rb') as f:
   data_dict = load(f)
pyu.plot(data_dict)

to produce basic plot

N.B.: Notice that intersections are exclusive, meaning that they form a partition of the union of the base sets.

Displayed intersections can also be filtered or sorted by size or degree:

pyu.plot(data_dict, unique_keys = ['title'], sort_by='degree', inters_size_bounds=(20, 400))

produces basic filtering

The example above also uses the unique_keys kwarg, which specifies columns of the underlying data frames in data_dict that can be used to uniquely identify rows and possibly speed up the computation of intersections.

Intersection highlighting

pyUpSet supports "queries", i.e. the highlighting of intersections. Intersections to highlight are specified through tuples. For example, the following call produces graphs where all data is highlighted that corresponds to movies classified as both "adventure" and "action", or "romance" and "war".

pyu.plot(data_dict, unique_keys = ['title'], 
         additional_plots=[{'kind':'scatter', 'data_quantities':{'x':'views', 'y':'rating_std'}},
                           {'kind':'hist', 'data_quantities':{'x':'views'}}],
         query = [('adventure', 'action'), ('romance', 'war')]
        )

simple query

Additional plots

It is possible to add further plots that use information contained in the data frames, as in

pyu.plot(data_dict, unique_keys = ['title'], 
         additional_plots=[{'kind':'scatter', 'data_quantities':{'x':'views', 'y':'rating_std'}},
                           {'kind':'hist', 'data_quantities':{'x':'views'}}]), 
         query = [('adventure', 'action'), ('romance', 'war')]

This produces additional plots with query

The highlighting produced by the queries is passed to the additional graphs. The dictionary specifying the additional graphs can also take standard matplotlib arguments as kwargs:

pyu.plot(data_dict, unique_keys = ['title'], 
        additional_plots=[{'kind':'scatter', 
                           'data_quantities':{'x':'views', 'y':'rating_std'},
                           'graph_properties':{'alpha':.8, 'lw':.4, 'edgecolor':'w', 's':50}},
                          {'kind':'hist', 
                           'data_quantities':{'x':'views'},
                           'graph_properties':{'bins':50}}], 
        query = [('adventure', 'action'), ('romance', 'war')])

yields additional plots with query and properties

A note on the input format

pyUpSet has a very specific use case: It is focussed on the study of intersections of sets. In order for a definition of intersection to make sense, and even more for the integration of additional graphs to be meaningful, it is assumed that the input data frames have properties of homonymy (they contain columns with the same names) and homogeneity (columns with the same name, intuitively, contain data of the same kind). While hononymy is a purely interface-dependent requirement whose aim is primarily to make pyUpSet's interface leaner, homogeneity has a functional role in allowing definitions of uniqueness and commonality for the data points in the input data frames.

Whenever possible, pyUpSet will try to check for (and enforce) the two above properties. In particular, when the unique_keys argument of plot is omitted, pyUpSet will try to use all columns with common names across the data frames as a list of unique keys. Under the hypotheses of homogeneity and homonymy this should be enough for all the operations carried out by pyUpSet to complete successfully.

Upcoming changes

Please bear in mind that pyUpset is under active development so current behaviour may change at any time. In particular, here is a list of changes, in no particular order, to be expected soon:

  • improved OO interface for increased flexibility and customisation
  • improved, automated scaling of figure and axes grid according to the number of sets, intersections and additional plots (at the moment manual resizing may be needed)
Comments
  • Input format

    Input format

    Hi

    I'm struggling to format my data for py-upset.

    As far as I can see from the sample data, each set is a dictionary key, and the data to be compared between sets is a series corresponding to a key.

    I created a dataframe with 2 columns- the first containing the set name and the second containing the corresponding strings:

    image

    I then created a dictionary with the set name as a key and the strings belonging to a given set as a series:

    image

    I ran the commands:

    % matplotlib inline import pyupset as pyu pyu.plot(be_dict)

    ...and received the error: AttributeError: 'Series' object has no attribute 'columns'

    Any help would be appreciated.

    opened by MarlaWillemse 2
  • Add Py-Upset to UpSet.App Webpage

    Add Py-Upset to UpSet.App Webpage

    Hi,

    I'm Alex, I'm the original developer of the first UpSet and the first-author on the UpSet paper.

    I'm reaching out because I've been working on a website about UpSet and all the different applications: https://upset.app/. I'm excited about your implementation and would love to include it. In that context, I have a question and a request:

    • Would it be OK if I use the image I've included on this page: https://upset.app/versions/

    • Would you be so kind to provide the information for your implementation, based on the template I've included below. Please feel free to describe your version at the bottom. Alternatively, you could also submit a pull request here: https://github.com/visdesignlab/upset-app/tree/main/_upsetversions

    Thanks!

    ---
    layout: default
    key: upset_original
    name: The original UpSet
    type: Interactive, Web-Based
    source: https://github.com/VCG/upset
    web: http://vcg.github.io/upset/
    documentation: https://github.com/VCG/upset/wiki
    image: upset_original.png
    authors:  Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, and Hanspeter Pfister
    publication: https://vdl.sci.utah.edu/publications/2014_infovis_upset/
    language: JavaScript
    license: MIT License
    maintained: no
    interactive: yes
    inline-attribute-vis: yes
    attribute-views: yes
    aggregation: yes
    item-queries: no
    set-queries: yes
    shows-deviation: yes
    export: no
    format-table: yes
    format-list: no
    format-set-expression: no
    ---
    The original UpSet, developed to go with the original paper, as an interactive web application. This version supports most advanced features. It lacks simple data upload functionality, so that it either has to be hosted locally, or pointed to a globally visible data file. Unfortunately, the original UpSet is no longer actively maintained.  
    

    The items:

    inline-attribute-vis:
    attribute-views:
    aggregation:
    item-queries:
    set-queries:
    shows-deviation:
    format-table: yes
    format-list: no
    format-set-expression: no
    

    refer to the complex features and data formats explained here https://upset.app/advanced/

    opened by alexsb 0
  • AttributeError: 'DataFrame' object has no attribute 'ix'

    AttributeError: 'DataFrame' object has no attribute 'ix'

    Hi, I installed pyupset on COLAB. I get AttributeError: 'DataFrame' object has no attribute 'ix' when trying to run simple plot(data_dict, unique_keys=None)

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-26-a1c443ff803d> in <module>()
          7 import pyupset
          8 
    ----> 9 pplot= pyupset.plot(classesDict(dataset1), unique_keys = ['frame_id'])
    
    3 frames
    /usr/local/lib/python3.7/dist-packages/pyupset/visualisation.py in plot(data_dict, unique_keys, sort_by, inters_size_bounds, inters_degree_bounds, additional_plots, query)
         56     all_columns = list(all_columns)
         57 
    ---> 58     plot_data = DataExtractor(data_dict, all_columns)
         59     ordered_inters_sizes, ordered_in_sets, ordered_out_sets = \
         60         plot_data.get_filtered_intersections(sort_by,inters_size_bounds,inters_degree_bounds)
    
    /usr/local/lib/python3.7/dist-packages/pyupset/visualisation.py in __init__(self, data_dict, unique_keys)
        510                                                                                             unique_keys)
        511         self.in_sets_list, self.inters_degrees, \
    --> 512         self.out_sets_list, self.inters_df_dict = self.extract_intersection_data()
        513 
        514 
    
    /usr/local/lib/python3.7/dist-packages/pyupset/visualisation.py in extract_intersection_data(self)
        569                 exclusive_intersection = exclusive_intersection.difference(pd.Index(self.df_dict[s][
        570                     self.unique_keys]))
    --> 571             final_df = self.df_dict[seed].set_index(pd.Index(self.df_dict[seed][self.unique_keys])).ix[
        572                 exclusive_intersection].reset_index(drop=True)
        573             inters_dict[in_sets] = final_df
    
    /usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __getattr__(self, name)
       5139             if self._info_axis._can_hold_identifiers_and_holds_name(name):
       5140                 return self[name]
    -> 5141             return object.__getattribute__(self, name)
       5142 
       5143     def __setattr__(self, name: str, value) -> None:
    
    ``
    opened by Gftakla 2
  • python3.7    IndexError: GridSpec slice would result in no space allocated for subplot

    python3.7 IndexError: GridSpec slice would result in no space allocated for subplot

    Folloing Error:

    pyu.plot(data_dict, unique_keys = ['title'], sort_by='degree', inters_size_bounds=(20, 400))
    /nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py:571: FutureWarning: 
    .ix is deprecated. Please use
    .loc for label based indexing or
    .iloc for positional indexing
    
    See the documentation here:
    http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
      final_df = self.df_dict[seed].set_index(pd.Index(self.df_dict[seed][self.unique_keys])).ix[
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py", line 63, in plot
        upset = UpSetPlot(len(ordered_dfs), len(ordered_in_sets), additional_plots, query)
      File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py", line 127, in __init__
        self.ax_setsize, self.ax_tablenames, self.additional_plots_axes = self._prepare_figure(additional_plots)
      File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/pyupset/visualisation.py", line 180, in _prepare_figure
        ax_setsize = plt.subplot(gs_top[-1:-setsize_h, 0:setsize_w])
      File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/matplotlib/gridspec.py", line 170, in __getitem__
        [_normalize(k1, nrows, 0), _normalize(k2, ncols, 1)],
      File "/nextomics/Software/Base/miniconda3/lib/python3.7/site-packages/matplotlib/gridspec.py", line 150, in _normalize
        raise IndexError("GridSpec slice would result in no space "
    IndexError: GridSpec slice would result in no space allocated for subplot
    

    Be like version not comptible, which version you used.

    opened by GrandH2O 0
  • Data input format?

    Data input format?

    Can you please include in your readme.md how to structure incoming data? I can't see anywhere what format my data frame needs to be in, in order to render a graph. The only solution is to grab and unpickle your test data which defeats the point of your readme.md instructions.

    opened by G-kodes 6
  • Bug? Numbers don't seem to add up

    Bug? Numbers don't seem to add up

    I have a set of three TSV files which I am reading as pandas.DataFrames. Because the data are being prepared for a manuscript in review, I will not share them here. I hope that my description of these files is sufficient to track down the problem.

    Basically, I am looking to perform an upset of genes with significant detections of splicing QTLs between tissues. Genes can have multiple splicing QTLs associated with them (multiple splicing events, multiple genomic variants). In one such test, I observe total gene counts on the order of 5-6e4 for each of 3 tissues. However, the intersection of all 3 is on the order of 2e6. This brings into doubt the assumption that the intersections are being computed correctly. My guess is that the intersection does not properly filter for unique intersecting rows.

    opened by PikalaxALT 0
  • Values instead of exponential

    Values instead of exponential

    Any idea how can i show the intersection size with exact value instead of the exponential.

    Here's i have highlighted the exponential that i'd like to be shown in numbers image

    opened by waqarali141 1
Releases(v0.1.post3)
Bayesian Additive Regression Trees For Python

BartPy Introduction BartPy is a pure python implementation of the Bayesian additive regressions trees model of Chipman et al [1]. Reasons to use BART

187 Dec 16, 2022
Forecasting prices using Facebook/Meta's Prophet model

CryptoForecasting using Machine and Deep learning (Part 1) CryptoForecasting using Machine Learning The main aspect of predicting the stock-related da

1 Nov 27, 2021
A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

modAL 1.9k Dec 31, 2022
A webpage that utilizes machine learning to extract sentiments from tweets.

Tweets_Classification_Webpage The goal of this project is to be able to predict what rating customers on social media platforms would give to products

Ayaz Nakhuda 1 Dec 30, 2021
Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

Christoph Mark 129 Dec 24, 2022
scikit-learn is a python module for machine learning built on top of numpy / scipy

About scikit-learn is a python module for machine learning built on top of numpy / scipy. The purpose of the scikit-learn-tutorial subproject is to le

Gael Varoquaux 122 Dec 12, 2022
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

Federal University of Rio Grande do Norte Technology Center Department of Computer Engineering and Automation Machine Learning Based Systems Design Re

Ivanovitch Silva 81 Oct 18, 2022
Coursera Machine Learning - Python code

Coursera Machine Learning This repository contains python implementations of certain exercises from the course by Andrew Ng. For a number of assignmen

Jordi Warmenhoven 859 Dec 10, 2022
Warren - Stock Price Predictor

Web app to predict closing stock prices in real time using Facebook's Prophet time series algorithm with a multi-variate, single-step time series forecasting strategy.

Kumar Nityan Suman 153 Jan 03, 2023
Python 3.6+ toolbox for submitting jobs to Slurm

Submit it! What is submitit? Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps

Facebook Incubator 768 Jan 03, 2023
This is a Machine Learning model which predicts the presence of Diabetes in Patients

Diabetes Disease Prediction This is a machine Learning mode which tries to determine if a person has a diabetes or not. Data The dataset is in comma s

Edem Gold 4 Mar 16, 2022
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023
A handy tool for common machine learning models' hyper-parameter tuning.

Common machine learning models' hyperparameter tuning This repo is for a collection of hyper-parameter tuning for "common" machine learning models, in

Kevin Hu 2 Jan 27, 2022
This is the material used in my free Persian course: Machine Learning with Python

This is the material used in my free Persian course: Machine Learning with Python

Yara Mohamadi 4 Aug 07, 2022
Fit interpretable models. Explain blackbox machine learning.

InterpretML - Alpha Release In the beginning machines learned in darkness, and data scientists struggled in the void to explain them. Let there be lig

InterpretML 5.2k Jan 09, 2023
Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc)

Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc). Structured a custom ensemble model and a neural network. Found a outperformed

Chris Yuan 1 Feb 06, 2022
Fourier-Bayesian estimation of stochastic volatility models

fourier-bayesian-sv-estimation Fourier-Bayesian estimation of stochastic volatility models Code used to run the numerical examples of "Bayesian Approa

15 Jun 20, 2022
Simple linear model implementations from scratch.

Hand Crafted Models Simple linear model implementations from scratch. Table of contents Overview Project Structure Getting started Citing this project

Jonathan Sadighian 2 Sep 13, 2021
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

23.3k Dec 31, 2022