CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

Related tags

Data AnalysiscleanX
Overview

cleanX

(DOI) License: GPL-3Anaconda-Server Badge Anaconda-Server Badge PyPI Anaconda-Server Badge Sanity Sanity

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images. JPEG files can be extracted from DICOM files or used directly.

The latest official release:

PyPI Anaconda-Server Badge

primary author: Candace Makeda H. Moore

other authors + contributors: Oleg Sivokon, Andrew Murphy

Continous Integration (CI) status

Sanity Sanity

Requirements

  • a python installation (3.7, 3.8 or 3.9)
  • ability to create virtual environments (recommended, not absolutely necessary)
  • tesserocr, matplotlib, pandas, pillow and opencv
  • optional recommendation of SimpleITK or pydicom for DICOM/dcm to JPG conversion
  • Anaconda is now supported, but not technically necessary

Developer's Guide

Please refer to Developer's Giude for more detailed explanation.

Developing Using Anaconda's Python

Use Git to check out the project's source, then, in the source directory run:

conda create -n cleanx
conda activate -n cleanx
python ./setup.py install_dev

You may have to do this for Python 3.7, Python 3.8 and Python 3.9 if you need to check that your changes will work in all supported versions.

Developing Using python.org's Python

Use Git to check out the project's source, then in the source directory run:

python -m venv .venv
. ./.venv/bin/activate
python ./setup.py install_dev

Similar to conda based setup, you may have to use Python versions 3.7, 3.8 and 3.9 to create three different environments to recreate our CI process.

Supported Platforms

cleanX package is a pure Python package, but it has many dependencies on native libraries. We try to test it on as many platforms as we can to see if dependencies can be installed there. Below is the list of platforms that will potentially work.

Whether python.org Python or Anaconda Python are supported, it means that version 3.7, 3.8 and 3.9 are supported. We know for certain that 3.6 is not supported, and there will be no support in the future.

32-bit Intell and ARM

We don't know if either one of these is supported. There's a good chance that 32-bit Intell will work. There's a good chance that ARM won't.

It's unlikely that the support will be added in the future.

AMD64 (x86)

Linux Win OSX
p Supported Unknown Unknown
a Supported Supported Supported

ARM64

Seems to be unsupported at the moment on both Linux and OSX, but it's likely that support will be added in the future.

Documentation

Online documentation at https://drcandacemakedamoore.github.io/cleanX/

You can also build up-to-date documentation by command.

Documentation can be generated by command:

python setup.py apidoc
python setup.py build_sphinx

The documentation will be generated in ./build/sphinx/html directory. Documentation is generated automatically as new functions are added.

Special additional documentation for medical professionals with limited programming ability is available on the wiki (https://github.com/drcandacemakedamoore/cleanX/wiki/Medical-professional-documentation).

To get a high level overview of some of the functionality of the program you can look at the Jupyter notebooks inside workflow_demo.

Installation

  • setting up a virtual environment is desirable, but not absolutely necessary

  • activate the environment

Anaconda Installation

  • use command for conda as below
conda install -c doctormakeda -c conda-forge cleanx

You need to specify both channels because there are some cleanX dependencies that exist in both Anaconda main channel and in conda-forge

pip installation

  • use pip as below
pip install cleanX

Getting Started

We will imagine a very simple scenario, where we need to automate normalization of the images we have. We stored the images in directory /images/to/clean/ and they all have jpg extension. We want the cleaned images to be saved in the cleaned directory.

Normalization here means ensuring that the lowest pixel value (the darkest part of the image) is as dark as possible and that the lightest part of the image is as light as possible.

CLI Example

The problem above doesn't require writing any new Python code. We can accomplish our task by calling the cleanX command like this:

mkdir cleaned

python -m cleanX images run-pipeline \
    -s Acqure \
    -s Normalize \
    -s "Save(target='cleaned')" \
    -j \
    -r "/images/to/clean/*.jpg"

Let's look at the command's options and arguments:

  • python -m cleanX is the Python's command-line option for loading the cleanX package. All command-line arguments that follow this part are interpreted by cleanX.
  • images sub-command is used for processing of images.
  • run-pipeline sub-command is used to start a Pipeline to process the images.
  • -s (repeatable) option specifies Pipeline Step. Steps map to their class names as found in the cleanX.image_work.steps module. If the __init__ function of a step doesn't take any arguments, only the class name is necessary. If, however, it takes arguments, they must be given using Python's literals, using Python's named arguments syntax.
  • -j option instructs to create journaling pipeline. Journaling pipelines can be restarted from the point where they failed, or had been interrupted.
  • -r allows to specify source for the pipeline. While, normally, we will want to start with Acquire step, if the pipeline was interrupted, we need to tell it where to look for the initial sources.

Once the command finishes, we should see the cleaned directory filled with images with the same names they had in the source directory.

Let's consider another simple task: batch-extraction of images from DICOM files:


mkdir extracted

python -m cleanX dicom extract \
    -i dir /path/to/dicoms/
    -o extracted

This calls cleanX CLI in the way similar to the example above, however, it calls the dicom sub-command with extract-images subcommand.

  • -i tells cleanX to look for directory named /path/to/dicoms
  • -o tells cleanX to save extracted JPGs in extracted directory.

If you have any problems with this check #40 and add issues or discussions.

Coding Example

Below is the equivalent code in Python:

import os

from cleanX.image_work import (
    Acquire,
    Save,
    GlobSource,
    Normalize,
    create_pipeline,
)

dst = 'cleaned'
os.mkdir(dst)

src = GlobSource('/images/to/clean/*.jpg')
p = create_pipeline(
    steps=(
        Acquire(),
        Normalize(),
        Save(dst),
    ),
    journal=True,
)

p.process(src)

Let's look at what's going on here. As before, we've created a pipeline using create_pipeline with three steps: Acquire, Normalize and Save. There are several kinds of sources available for pipelines. We'll use the GlobSource to match our CLI example. We'll specify journal=True to match the -j flag in our CLI example.


And for the DICOM extraction we might use similar code:

imort os

from cleanX.dicom_processing import DicomReader, DirectorySource

dst = 'extracted'
os.mkdir(dst)

reader = DicomReader()
reader.rip_out_jpgs(DirectorySource('/path/to/dicoms/', 'file'), dst)

This will look for the files with dcm extension in /path/to/dicoms/ and try to extract images found in those files, saving them in extracted directory.

About using this library

If you use the library, please credit me and my collaborators. You are only free to use this library according to license. We hope that if you use the library you will open source your entire code base, and send us modifications. You can get in touch with me by starting a discussion (https://github.com/drcandacemakedamoore/cleanX/discussions/37) if you have a legitimate reason to use my library without open-sourcing your code base, or following other conditions, and I can make you specifically a different license.

We are adding new functions and classes all the time. Many unit tests are available in the test folder. Test coverage is currently partial. Some newly added functions allow for rapid automated data augmentation (in ways that are realistic for radiological data). Some other classes and functions are for cleaning datasets including ones that:

  • Get image and metadata out of dcm (DICOM) files into jpeg and csv files
  • Process datasets from csv or json or other formats to generate reports
  • Run on dataframes to make sure there is no image leakage
  • Run on a dataframe to look for demographic or other biases in patients
  • Crop off excessive black frames (run this on single images) one at a time
  • Run on a list to make a prototype tiny Xray others can be compared to
  • Run on image files which are inside a folder to check if they are "clean"
  • Take a dataframe with image names and return plotted(visualized) images
  • Run to make a dataframe of pics in a folder (assuming they all have the same 'label'/diagnosis)
  • Normalize images in terms of pixel values (multiple methods)

All important functions are documented in the online documentation for programmers. You can also check out one of our videos by clicking the linked picture below:

Video

Comments
  • Joss issues

    Joss issues

    This is a list of some improvements/suggestions or issues that may need clarifications.

    • [x] Is this file needed GNU GENERAL PUBLIC LICENSE.txt?

    • [x] Include Conda badges https://anaconda.org/doctormakeda/cleanx/badges

    • [x] Make sure that the test badges link to the test builds. Currently, they link to the image of the badge. Sanity

    • [x] Create a paper folder for the paper files and include a copy of the LICENSE file.

    • [x] Include some examples on how to get started in the readme file. The same applies to the documentation. I would expect at least some sort of getting started guide.

    • [x] Since version v0.1.9 was released, I would expect the current changes to have v0.2.0.dev as the version for these changes in development. Later to be released as v0.2.0. But if you desire to have the current pattern, thats fine.

    • [x] Move all document files to a docs folder. I think readthedocs could also enable the docs have two versions, the stable and the latest.

    • [x] In the Jupyter we have paths like 'D:/projects/cleanX' It would be nice to start by getting the current project's directory and then use relative paths with join. For example:

    dicomfile_directory1 = 'D:/projects/cleanX/test/dicom_example_folder'
    example = pd.read_csv("D:/projects/cleanX/workflow_demo/martians_2051.csv")
    # To
    working_dir = "Path to project home"
    example_path = os.path.normpath(os.path.join(working_dir, "workflow_demo/martians_2051.csv"))
    example = pd.read_csv(example_path)
    

    It would be nice to normalize the paths. This will help Windows users who have a hard time with / and \ characters

    opened by henrykironde 20
  • Examples and workflow_demo

    Examples and workflow_demo

    @drcandacemakedamoore 👍🏿 for getting this to finally install smoothly. Some issues that I have are detailed below.

    README.md Example:

    • [ ] Add s check to see if the path exist cleaned or always delete it first and then make a new one.
    dst = 'cleaned'
    if not  os.path.exists(dst):
        os.mkdir(dst)
    
    dst = 'cleaned'
    os.rmdir(dst)
    os.mkdir(dst)
    

    Improve this README.md example, I had to install SimpleITK and PyDICOM. You could add this to required dependencies.

    (cleanx) henrysenyondo ~/Downloads/cleanX [main] $ python examplecleanX.py 
    WARNING:root:Don't know how to find Tesseract library version
    /Users/henrysenyondo/Downloads/cleanX/cleanX/dicom_processing/__init__.py:37: UserWarning: 
    Neither SimpleITK nor PyDICOM are installed.
    
    Will not be able to extract information from DICOM files.
    
      warnings.warn(
    Traceback (most recent call last):
      File "examplecleanX.py", line 36, in <module>
        from cleanX.dicom_processing import DicomReader
    ImportError: cannot import name 'DicomReader' from 'cleanX.dicom_processing' (/Users/henrysenyondo/Downloads/cleanX/cleanX/dicom_processing/__init__.py)
    (cleanx) henrysenyondo ~/Downloads/cleanX [main] $
    
    

    Use a path that does actually exist in the repo src = GlobSource('/images/to/clean/*.jpg')

    workflow_demo examples:

    • [ ] Use paths that do exist in the repo, or add a comment to point to the data to be used in that given example. Assume that the user is going to run the examples in the root directory, so all paths could be relative to that directory. In the example from cleanX/workflow_demo/classes_workflow.ipynb
    • [ ] Refactor the workflow_demo files, rename them appropriately remove files not needed.
    opened by henrykironde 17
  • pip install cleanx, on mac errors

    pip install cleanx, on mac errors

    Describe the bug No package 'tesseract' found

    Screenshots

    Using legacy 'setup.py install' for tesserocr, since package 'wheel' is not installed.
    Installing collected packages: tesserocr, opencv-python, matplotlib, cleanX
        Running setup.py install for tesserocr ... error
        ERROR: Command errored out with exit status 1:
         command: /Users/henrykironde/Documents/GitHub/testenv/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"'; __file__='"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-record-zx09fpub/install-record.txt --single-version-externally-managed --compile --install-headers /Users/henrykironde/Documents/GitHub/testenv/include/site/python3.9/tesserocr
             cwd: /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/
        Complete output (20 lines):
        pkg-config failed to find tesseract/leptonica libraries: Package tesseract was not found in the pkg-config search path.
        Perhaps you should add the directory containing `tesseract.pc'
        to the PKG_CONFIG_PATH environment variable
        No package 'tesseract' found
        
        Failed to extract tesseract version from executable: [Errno 2] No such file or directory: 'tesseract'
        Supporting tesseract v3.04.00
        Tesseract major version 3
        Building with configs: {'libraries': ['tesseract', 'lept'], 'compile_time_env': {'TESSERACT_MAJOR_VERSION': 3, 'TESSERACT_VERSION': 50593792}}
        WARNING: The wheel package is not available.
        running install
        running build
        running build_ext
        Detected compiler: unix
        building 'tesserocr' extension
        creating build
        creating build/temp.macosx-11-x86_64-3.9
        clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/Users/henry/Documents/GitHub/testenv/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.9/include/python3.9 -c tesserocr.cpp -o build/temp.macosx-11-x86_64-3.9/tesserocr.o
        clang: error: invalid version number in 'MACOSX_DEPLOYMENT_TARGET=11'
        error: command '/usr/bin/clang' failed with exit code 1
        ----------------------------------------
    ERROR: Command errored out with exit status 1: /Users/henrykironde/Documents/GitHub/testenv/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"'; __file__='"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-record-zx09fpub/install-record.txt --single-version-externally-managed --compile --install-headers /Users/henrykironde/Documents/GitHub/testenv/include/site/python3.9/tesserocr Check the logs for full command output.
    WARNING: You are using pip version 21.1.3; however, version 21.2.4 is available.
    
    (testenv) ➜  cleanX git:(docs) ✗ 
    

    Your computer environment info: (please complete the following information):

    OS: [MacOSX]
    Python V=version [3.9]
    
    opened by henrykironde 7
  • Language cleanup and typos

    Language cleanup and typos

    CleanX uses some sensitive language that may offend some users. I would recommend that you remove words like idiots since it is against the code of conduct for Joss.

    There are typos in the doc strings, like """This class allows normalization by throwing off exxtreme values on" It would be nice to look through the doc strings and try to remove the typos.

    Note: I am still failing to install CleanX, but I think it is some complications with my Conda setup. I will keep you updated. My target is to finish with the review and final decision in 14 days.

    Ref: openjournals/joss-reviews#3632

    opened by henrykironde 5
  • wrong version of zero_to_twofivefive_simplest_norming()

    wrong version of zero_to_twofivefive_simplest_norming()

    We seem to have put in an older (with a small bug) version of the zero_to_twofivefive_simplest_norming(). All image normalization functions should be tested and updated tonight (24/1/2022)

    opened by drcandacemakedamoore 3
  • Suggestions

    Suggestions

    Can you add documentation in the following files?

    • journaline_pipeline.py, starting from line 110
    • steps.py starting from line 112
    • Many functions in the fils dataframes.py, pydicom_adapter.py, and simpleitk_adapter.py
    opened by sbonaretti 3
  • Dependency

    Dependency

    Create a report to help us improve

    Describe the bug tesserocr

    To Reproduce Steps to reproduce the behavior:

    pip install cleanx
    

    Expected behavior A clear and concise description of what you expected to happen. ERROR: Failed building wheel for tesserocr Running setup.py clean for tesserocr

    Screenshots If applicable, add screenshots to help explain your problem.

    Your computer environment info: (please complete the following information): Ubuntu 16.

    OS: [e.g. Linux]
    Python V=version [e.g. 3.7]
    

    I think you should add minimum requirement in the readme file

    opened by delwende 3
  • Testing builds on Windows and Mac

    Testing builds on Windows and Mac

    It would be nice the builds are tested on Windows and Mac. One can do that using GitHub actions: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#example-adding-configurations https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idruns-on

    opened by fdiblen 2
  • fix image comparison, probably with numpy allclose() function

    fix image comparison, probably with numpy allclose() function

    Image comparison for copies function is too slow and memory intensive at present. Maybe we can implement something with the numpy library that is faster.

    opened by drcandacemakedamoore 2
  • [Security] Workflow on-tag.yml is using vulnerable action s-weigand/setup-conda

    [Security] Workflow on-tag.yml is using vulnerable action s-weigand/setup-conda

    The workflow on-tag.yml is referencing action s-weigand/setup-conda using references v1. However this reference is missing the commit a30654e576ab9e21a25825bf7a5d5f2a9b95b202 which may contain fix to the some vulnerability. The vulnerability fix that is missing by actions version could be related to: (1) CVE fix (2) upgrade of vulnerable dependency (3) fix to secret leak and others. Please consider to update the reference to the action.

    opened by fockboi-lgtm 2
  • Clutter in documentation

    Clutter in documentation

    In retrospect, using https://www.sphinx-doc.org/en/master/man/sphinx-apidoc.html was a bad idea. The code it generates is awful and impossible to control. In particular, there's no way to disable or enable special methods on per-class basis. Similarly for inheritance etc.

    Apparently, we need to replace this with something else that would generate sensible documentation pages. There's no hope that sphinx-apidoc will ever improve.

    opened by wvxvw 2
  • color normalizer- after JOSS review finishes

    color normalizer- after JOSS review finishes

    Some of our users are applying this to color images i.e. endoscopic images. This is by change, and it could have been pathology images. We should add functions explicitly for this starting with finding color outliers. I will attack this once the JOSS review completes.

    opened by drcandacemakedamoore 0
Releases(v0.1.14)
Owner
Candace Makeda Moore, MD
Python, SQL, Javascript, and HTML. I love imaging informatics.
Candace Makeda Moore, MD
Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

the techno hack 3 Oct 23, 2022
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.

Raster_Sampling_Demo (Resulting graph of this demo) Background Sampling values of a raster at specific geographic coordinates can be done with a numbe

2 Dec 13, 2022
An easy-to-use feature store

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ByteHub AI 48 Dec 09, 2022
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 04, 2022
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 04, 2021
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

6 Oct 11, 2022
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
Using approximate bayesian posteriors in deep nets for active learning

Bayesian Active Learning (BaaL) BaaL is an active learning library developed at ElementAI. This repository contains techniques and reusable components

ElementAI 687 Dec 25, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This repo contains a powerful tool made using python which is used to visualize, analyse and finally assess the quality of the product depending upon the given observations

SasiVatsal 8 Oct 18, 2022
Very basic but functional Kakuro solver written in Python.

kakuro.py Very basic but functional Kakuro solver written in Python. It uses a reduction to exact set cover and Ali Assaf's elegant implementation of

Louis Abraham 4 Jan 15, 2022
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

BioMASS 22 Dec 27, 2022
Project under the certification "Data Analysis with Python" on FreeCodeCamp

Sea Level Predictor Assignment You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea

Bhavya Gopal 3 Jan 31, 2022
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022