CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

Related tags

Data AnalysiscleanX
Overview

cleanX

(DOI) License: GPL-3Anaconda-Server Badge Anaconda-Server Badge PyPI Anaconda-Server Badge Sanity Sanity

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images. JPEG files can be extracted from DICOM files or used directly.

The latest official release:

PyPI Anaconda-Server Badge

primary author: Candace Makeda H. Moore

other authors + contributors: Oleg Sivokon, Andrew Murphy

Continous Integration (CI) status

Sanity Sanity

Requirements

  • a python installation (3.7, 3.8 or 3.9)
  • ability to create virtual environments (recommended, not absolutely necessary)
  • tesserocr, matplotlib, pandas, pillow and opencv
  • optional recommendation of SimpleITK or pydicom for DICOM/dcm to JPG conversion
  • Anaconda is now supported, but not technically necessary

Developer's Guide

Please refer to Developer's Giude for more detailed explanation.

Developing Using Anaconda's Python

Use Git to check out the project's source, then, in the source directory run:

conda create -n cleanx
conda activate -n cleanx
python ./setup.py install_dev

You may have to do this for Python 3.7, Python 3.8 and Python 3.9 if you need to check that your changes will work in all supported versions.

Developing Using python.org's Python

Use Git to check out the project's source, then in the source directory run:

python -m venv .venv
. ./.venv/bin/activate
python ./setup.py install_dev

Similar to conda based setup, you may have to use Python versions 3.7, 3.8 and 3.9 to create three different environments to recreate our CI process.

Supported Platforms

cleanX package is a pure Python package, but it has many dependencies on native libraries. We try to test it on as many platforms as we can to see if dependencies can be installed there. Below is the list of platforms that will potentially work.

Whether python.org Python or Anaconda Python are supported, it means that version 3.7, 3.8 and 3.9 are supported. We know for certain that 3.6 is not supported, and there will be no support in the future.

32-bit Intell and ARM

We don't know if either one of these is supported. There's a good chance that 32-bit Intell will work. There's a good chance that ARM won't.

It's unlikely that the support will be added in the future.

AMD64 (x86)

Linux Win OSX
p Supported Unknown Unknown
a Supported Supported Supported

ARM64

Seems to be unsupported at the moment on both Linux and OSX, but it's likely that support will be added in the future.

Documentation

Online documentation at https://drcandacemakedamoore.github.io/cleanX/

You can also build up-to-date documentation by command.

Documentation can be generated by command:

python setup.py apidoc
python setup.py build_sphinx

The documentation will be generated in ./build/sphinx/html directory. Documentation is generated automatically as new functions are added.

Special additional documentation for medical professionals with limited programming ability is available on the wiki (https://github.com/drcandacemakedamoore/cleanX/wiki/Medical-professional-documentation).

To get a high level overview of some of the functionality of the program you can look at the Jupyter notebooks inside workflow_demo.

Installation

  • setting up a virtual environment is desirable, but not absolutely necessary

  • activate the environment

Anaconda Installation

  • use command for conda as below
conda install -c doctormakeda -c conda-forge cleanx

You need to specify both channels because there are some cleanX dependencies that exist in both Anaconda main channel and in conda-forge

pip installation

  • use pip as below
pip install cleanX

Getting Started

We will imagine a very simple scenario, where we need to automate normalization of the images we have. We stored the images in directory /images/to/clean/ and they all have jpg extension. We want the cleaned images to be saved in the cleaned directory.

Normalization here means ensuring that the lowest pixel value (the darkest part of the image) is as dark as possible and that the lightest part of the image is as light as possible.

CLI Example

The problem above doesn't require writing any new Python code. We can accomplish our task by calling the cleanX command like this:

mkdir cleaned

python -m cleanX images run-pipeline \
    -s Acqure \
    -s Normalize \
    -s "Save(target='cleaned')" \
    -j \
    -r "/images/to/clean/*.jpg"

Let's look at the command's options and arguments:

  • python -m cleanX is the Python's command-line option for loading the cleanX package. All command-line arguments that follow this part are interpreted by cleanX.
  • images sub-command is used for processing of images.
  • run-pipeline sub-command is used to start a Pipeline to process the images.
  • -s (repeatable) option specifies Pipeline Step. Steps map to their class names as found in the cleanX.image_work.steps module. If the __init__ function of a step doesn't take any arguments, only the class name is necessary. If, however, it takes arguments, they must be given using Python's literals, using Python's named arguments syntax.
  • -j option instructs to create journaling pipeline. Journaling pipelines can be restarted from the point where they failed, or had been interrupted.
  • -r allows to specify source for the pipeline. While, normally, we will want to start with Acquire step, if the pipeline was interrupted, we need to tell it where to look for the initial sources.

Once the command finishes, we should see the cleaned directory filled with images with the same names they had in the source directory.

Let's consider another simple task: batch-extraction of images from DICOM files:


mkdir extracted

python -m cleanX dicom extract \
    -i dir /path/to/dicoms/
    -o extracted

This calls cleanX CLI in the way similar to the example above, however, it calls the dicom sub-command with extract-images subcommand.

  • -i tells cleanX to look for directory named /path/to/dicoms
  • -o tells cleanX to save extracted JPGs in extracted directory.

If you have any problems with this check #40 and add issues or discussions.

Coding Example

Below is the equivalent code in Python:

import os

from cleanX.image_work import (
    Acquire,
    Save,
    GlobSource,
    Normalize,
    create_pipeline,
)

dst = 'cleaned'
os.mkdir(dst)

src = GlobSource('/images/to/clean/*.jpg')
p = create_pipeline(
    steps=(
        Acquire(),
        Normalize(),
        Save(dst),
    ),
    journal=True,
)

p.process(src)

Let's look at what's going on here. As before, we've created a pipeline using create_pipeline with three steps: Acquire, Normalize and Save. There are several kinds of sources available for pipelines. We'll use the GlobSource to match our CLI example. We'll specify journal=True to match the -j flag in our CLI example.


And for the DICOM extraction we might use similar code:

imort os

from cleanX.dicom_processing import DicomReader, DirectorySource

dst = 'extracted'
os.mkdir(dst)

reader = DicomReader()
reader.rip_out_jpgs(DirectorySource('/path/to/dicoms/', 'file'), dst)

This will look for the files with dcm extension in /path/to/dicoms/ and try to extract images found in those files, saving them in extracted directory.

About using this library

If you use the library, please credit me and my collaborators. You are only free to use this library according to license. We hope that if you use the library you will open source your entire code base, and send us modifications. You can get in touch with me by starting a discussion (https://github.com/drcandacemakedamoore/cleanX/discussions/37) if you have a legitimate reason to use my library without open-sourcing your code base, or following other conditions, and I can make you specifically a different license.

We are adding new functions and classes all the time. Many unit tests are available in the test folder. Test coverage is currently partial. Some newly added functions allow for rapid automated data augmentation (in ways that are realistic for radiological data). Some other classes and functions are for cleaning datasets including ones that:

  • Get image and metadata out of dcm (DICOM) files into jpeg and csv files
  • Process datasets from csv or json or other formats to generate reports
  • Run on dataframes to make sure there is no image leakage
  • Run on a dataframe to look for demographic or other biases in patients
  • Crop off excessive black frames (run this on single images) one at a time
  • Run on a list to make a prototype tiny Xray others can be compared to
  • Run on image files which are inside a folder to check if they are "clean"
  • Take a dataframe with image names and return plotted(visualized) images
  • Run to make a dataframe of pics in a folder (assuming they all have the same 'label'/diagnosis)
  • Normalize images in terms of pixel values (multiple methods)

All important functions are documented in the online documentation for programmers. You can also check out one of our videos by clicking the linked picture below:

Video

Comments
  • Joss issues

    Joss issues

    This is a list of some improvements/suggestions or issues that may need clarifications.

    • [x] Is this file needed GNU GENERAL PUBLIC LICENSE.txt?

    • [x] Include Conda badges https://anaconda.org/doctormakeda/cleanx/badges

    • [x] Make sure that the test badges link to the test builds. Currently, they link to the image of the badge. Sanity

    • [x] Create a paper folder for the paper files and include a copy of the LICENSE file.

    • [x] Include some examples on how to get started in the readme file. The same applies to the documentation. I would expect at least some sort of getting started guide.

    • [x] Since version v0.1.9 was released, I would expect the current changes to have v0.2.0.dev as the version for these changes in development. Later to be released as v0.2.0. But if you desire to have the current pattern, thats fine.

    • [x] Move all document files to a docs folder. I think readthedocs could also enable the docs have two versions, the stable and the latest.

    • [x] In the Jupyter we have paths like 'D:/projects/cleanX' It would be nice to start by getting the current project's directory and then use relative paths with join. For example:

    dicomfile_directory1 = 'D:/projects/cleanX/test/dicom_example_folder'
    example = pd.read_csv("D:/projects/cleanX/workflow_demo/martians_2051.csv")
    # To
    working_dir = "Path to project home"
    example_path = os.path.normpath(os.path.join(working_dir, "workflow_demo/martians_2051.csv"))
    example = pd.read_csv(example_path)
    

    It would be nice to normalize the paths. This will help Windows users who have a hard time with / and \ characters

    opened by henrykironde 20
  • Examples and workflow_demo

    Examples and workflow_demo

    @drcandacemakedamoore 👍🏿 for getting this to finally install smoothly. Some issues that I have are detailed below.

    README.md Example:

    • [ ] Add s check to see if the path exist cleaned or always delete it first and then make a new one.
    dst = 'cleaned'
    if not  os.path.exists(dst):
        os.mkdir(dst)
    
    dst = 'cleaned'
    os.rmdir(dst)
    os.mkdir(dst)
    

    Improve this README.md example, I had to install SimpleITK and PyDICOM. You could add this to required dependencies.

    (cleanx) henrysenyondo ~/Downloads/cleanX [main] $ python examplecleanX.py 
    WARNING:root:Don't know how to find Tesseract library version
    /Users/henrysenyondo/Downloads/cleanX/cleanX/dicom_processing/__init__.py:37: UserWarning: 
    Neither SimpleITK nor PyDICOM are installed.
    
    Will not be able to extract information from DICOM files.
    
      warnings.warn(
    Traceback (most recent call last):
      File "examplecleanX.py", line 36, in <module>
        from cleanX.dicom_processing import DicomReader
    ImportError: cannot import name 'DicomReader' from 'cleanX.dicom_processing' (/Users/henrysenyondo/Downloads/cleanX/cleanX/dicom_processing/__init__.py)
    (cleanx) henrysenyondo ~/Downloads/cleanX [main] $
    
    

    Use a path that does actually exist in the repo src = GlobSource('/images/to/clean/*.jpg')

    workflow_demo examples:

    • [ ] Use paths that do exist in the repo, or add a comment to point to the data to be used in that given example. Assume that the user is going to run the examples in the root directory, so all paths could be relative to that directory. In the example from cleanX/workflow_demo/classes_workflow.ipynb
    • [ ] Refactor the workflow_demo files, rename them appropriately remove files not needed.
    opened by henrykironde 17
  • pip install cleanx, on mac errors

    pip install cleanx, on mac errors

    Describe the bug No package 'tesseract' found

    Screenshots

    Using legacy 'setup.py install' for tesserocr, since package 'wheel' is not installed.
    Installing collected packages: tesserocr, opencv-python, matplotlib, cleanX
        Running setup.py install for tesserocr ... error
        ERROR: Command errored out with exit status 1:
         command: /Users/henrykironde/Documents/GitHub/testenv/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"'; __file__='"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-record-zx09fpub/install-record.txt --single-version-externally-managed --compile --install-headers /Users/henrykironde/Documents/GitHub/testenv/include/site/python3.9/tesserocr
             cwd: /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/
        Complete output (20 lines):
        pkg-config failed to find tesseract/leptonica libraries: Package tesseract was not found in the pkg-config search path.
        Perhaps you should add the directory containing `tesseract.pc'
        to the PKG_CONFIG_PATH environment variable
        No package 'tesseract' found
        
        Failed to extract tesseract version from executable: [Errno 2] No such file or directory: 'tesseract'
        Supporting tesseract v3.04.00
        Tesseract major version 3
        Building with configs: {'libraries': ['tesseract', 'lept'], 'compile_time_env': {'TESSERACT_MAJOR_VERSION': 3, 'TESSERACT_VERSION': 50593792}}
        WARNING: The wheel package is not available.
        running install
        running build
        running build_ext
        Detected compiler: unix
        building 'tesserocr' extension
        creating build
        creating build/temp.macosx-11-x86_64-3.9
        clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/Users/henry/Documents/GitHub/testenv/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.9/include/python3.9 -c tesserocr.cpp -o build/temp.macosx-11-x86_64-3.9/tesserocr.o
        clang: error: invalid version number in 'MACOSX_DEPLOYMENT_TARGET=11'
        error: command '/usr/bin/clang' failed with exit code 1
        ----------------------------------------
    ERROR: Command errored out with exit status 1: /Users/henrykironde/Documents/GitHub/testenv/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"'; __file__='"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-record-zx09fpub/install-record.txt --single-version-externally-managed --compile --install-headers /Users/henrykironde/Documents/GitHub/testenv/include/site/python3.9/tesserocr Check the logs for full command output.
    WARNING: You are using pip version 21.1.3; however, version 21.2.4 is available.
    
    (testenv) ➜  cleanX git:(docs) ✗ 
    

    Your computer environment info: (please complete the following information):

    OS: [MacOSX]
    Python V=version [3.9]
    
    opened by henrykironde 7
  • Language cleanup and typos

    Language cleanup and typos

    CleanX uses some sensitive language that may offend some users. I would recommend that you remove words like idiots since it is against the code of conduct for Joss.

    There are typos in the doc strings, like """This class allows normalization by throwing off exxtreme values on" It would be nice to look through the doc strings and try to remove the typos.

    Note: I am still failing to install CleanX, but I think it is some complications with my Conda setup. I will keep you updated. My target is to finish with the review and final decision in 14 days.

    Ref: openjournals/joss-reviews#3632

    opened by henrykironde 5
  • wrong version of zero_to_twofivefive_simplest_norming()

    wrong version of zero_to_twofivefive_simplest_norming()

    We seem to have put in an older (with a small bug) version of the zero_to_twofivefive_simplest_norming(). All image normalization functions should be tested and updated tonight (24/1/2022)

    opened by drcandacemakedamoore 3
  • Suggestions

    Suggestions

    Can you add documentation in the following files?

    • journaline_pipeline.py, starting from line 110
    • steps.py starting from line 112
    • Many functions in the fils dataframes.py, pydicom_adapter.py, and simpleitk_adapter.py
    opened by sbonaretti 3
  • Dependency

    Dependency

    Create a report to help us improve

    Describe the bug tesserocr

    To Reproduce Steps to reproduce the behavior:

    pip install cleanx
    

    Expected behavior A clear and concise description of what you expected to happen. ERROR: Failed building wheel for tesserocr Running setup.py clean for tesserocr

    Screenshots If applicable, add screenshots to help explain your problem.

    Your computer environment info: (please complete the following information): Ubuntu 16.

    OS: [e.g. Linux]
    Python V=version [e.g. 3.7]
    

    I think you should add minimum requirement in the readme file

    opened by delwende 3
  • Testing builds on Windows and Mac

    Testing builds on Windows and Mac

    It would be nice the builds are tested on Windows and Mac. One can do that using GitHub actions: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#example-adding-configurations https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idruns-on

    opened by fdiblen 2
  • fix image comparison, probably with numpy allclose() function

    fix image comparison, probably with numpy allclose() function

    Image comparison for copies function is too slow and memory intensive at present. Maybe we can implement something with the numpy library that is faster.

    opened by drcandacemakedamoore 2
  • [Security] Workflow on-tag.yml is using vulnerable action s-weigand/setup-conda

    [Security] Workflow on-tag.yml is using vulnerable action s-weigand/setup-conda

    The workflow on-tag.yml is referencing action s-weigand/setup-conda using references v1. However this reference is missing the commit a30654e576ab9e21a25825bf7a5d5f2a9b95b202 which may contain fix to the some vulnerability. The vulnerability fix that is missing by actions version could be related to: (1) CVE fix (2) upgrade of vulnerable dependency (3) fix to secret leak and others. Please consider to update the reference to the action.

    opened by fockboi-lgtm 2
  • Clutter in documentation

    Clutter in documentation

    In retrospect, using https://www.sphinx-doc.org/en/master/man/sphinx-apidoc.html was a bad idea. The code it generates is awful and impossible to control. In particular, there's no way to disable or enable special methods on per-class basis. Similarly for inheritance etc.

    Apparently, we need to replace this with something else that would generate sensible documentation pages. There's no hope that sphinx-apidoc will ever improve.

    opened by wvxvw 2
  • color normalizer- after JOSS review finishes

    color normalizer- after JOSS review finishes

    Some of our users are applying this to color images i.e. endoscopic images. This is by change, and it could have been pathology images. We should add functions explicitly for this starting with finding color outliers. I will attack this once the JOSS review completes.

    opened by drcandacemakedamoore 0
Releases(v0.1.14)
Owner
Candace Makeda Moore, MD
Python, SQL, Javascript, and HTML. I love imaging informatics.
Candace Makeda Moore, MD
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
.npy, .npz, .mtx converter.

npy-converter Matrix Data Converter. Expand matrix for multi-thread, multi-process Divid matrix for multi-thread, multi-process Support: .mtx, .npy, .

taka 1 Feb 07, 2022
Automated Exploration Data Analysis on a financial dataset

Automated EDA on financial dataset Just a simple way to get automated Exploration Data Analysis from financial dataset (OHLCV) using Streamlit and ta.

Darío López Padial 28 Nov 27, 2022
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
A multi-platform GUI for bit-based analysis, processing, and visualization

A multi-platform GUI for bit-based analysis, processing, and visualization

Mahlet 529 Dec 19, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
A utility for functional piping in Python that allows you to access any function in any scope as a partial.

WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle

Michael Milton 1 Oct 26, 2021
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022
Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

Glotzer Group 195 Dec 20, 2022