HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

Overview

Azure Github Drone rtd Codecov

python_version pypi_version anaconda_cloud

gitter DOI

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays of a given signal (for example, a 2D array of spectra, also known as a spectrum image).

HyperSpy makes it straightforward to apply analytical procedures that operate on an individual signal to multidimensional arrays, as well as providing easy access to analytical tools that exploit the multidimensionality of the dataset.

Its modular structure makes it easy to add features to analyze many different types of signals.

HyperSpy is released under the GPL v3 license.

Since version 0.8.4, HyperSpy only supports Python 3. If you need to install HyperSpy in Python 2.7, please install version 0.8.3.

Contributing

Everyone is welcome to contribute. Please read our contributing guidelines and get started!

Comments
  • Making separate IO-library

    Making separate IO-library

    As discussed in https://github.com/hyperspy/hyperspy/issues/1599, the plan is to make a separate IO-library which contains the functionality to read and write various data formats.

    This library should ideally only depend on the "standard" scientific python stack (numpy, scipy, ...), to facilitate reuse in other projects. It should therefore return standard python objects, which the file readers currently do.

    To make this transition easier @ericpre suggested doing this for a new file format in 1.4, to test the new setup (https://github.com/hyperspy/hyperspy/issues/1599#issuecomment-388555703). A possible candidate for this could be Merlin binary (mib) format, which currently has 3(?) different implementations.

    Things to figure out:

    • What name should the library have? (emfilepy?)
    • Where should it be located? At least initially it should be located in the hyperspy github group
    • Is the Merlin binary format a good place to start?
    • How to handle out-of-core (lazy) loading? Rely on dask or return a numpy memmap object?
    release: next major 
    opened by magnunor 114
  • EELS DB Website Integration

    EELS DB Website Integration

    Hi there,

    I'm the web developer behind a recent redevelopment of the EELS Database website. The old site (http://pc-web.cemes.fr/eelsdb/) has been around for years and serves as a public repository for EELS spectra.

    The new site (http://eelsdb.eu) has been re-written from scratch and is currently in beta, due to go live soon. The main objective was to open up the site a lot more - both to make data submission far simpler, but also to make it easier to browse and access the data. There are currently ~220 spectra on the site and we hope it will grow significantly after launch.

    A recent addition to the site is a new API. It's currently very simple - it allows spectra metadata to be retrieved (with a link to the raw data file), and browsed / searched. It could be extended to handle other features such as data uploads and comments (which would presumably require authentication of some kind). Data is currently returned as JSON.

    I'm not a physicist myself, so I'm totally out of my depth when it comes to the science of these spectra (I'm a bioinformatician in my day job). However, Luc Lajaunie, who is handling the project said that integration between the site and HyperSpy could be cool! Certainly it would be nice to lower the boundaries for deposition and retrieval of data on the archive in any way possible. For users of HyperSpy it could be an opportunity to easily access a large database of spectra for comparison / use.

    Let me know if you guys on the project would be interested in any such integration. Probably a good start would be to have a look at the beta site (I think you can register yourselves, or drop me a mail), then if you're keen we can chat about the scope of any work.

    Phil

    opened by ewels 102
  • Would anybody be interested in the bruker's bcf format read-only functionality

    Would anybody be interested in the bruker's bcf format read-only functionality

    I think I studied (spent half week in hex) enough of brukers *.bcf files and figured out how stuff works. How many people would be interested in importing data straight from bcf functionality? Would authors of this project be interested, or does this contradicts the aims of the project (because bcf is proprietary format)? I am going to write library anyway (as in europe union we have right to RE and open our data) and I wish it would be useful to this project (by the way I have some piece of code for reading brukers EDS...). However I am not very experienced (I am geologist in the first place and programming just as hobby), I am average in python and I do not know C or C++. How many people would be interested, and would it be useful for the hyperspy, and could I get some helping hand or advices?

    opened by sem-geologist 93
  • ENH-FIX: change distutils to setuptools and add flexible cythonization

    ENH-FIX: change distutils to setuptools and add flexible cythonization

    this code fixes two things:

    1. changes distutils to setuptools
    2. adds the flexible cythonization.

    first is straight forward, but second have to be discussed. Firstly lets consider these points:

    • User can obtain hyperspy in few different ways: git, binary distribution, source distribution.
    • It is highly recommended by cython documentation to distribute C files generated from cython as this makes cython not required during installation and prevents undefined behavior resulting from different versions of cython.
    • cython generated C files are huge, complicated, contains lots of "boiler plates" is quite ugly and hardly readable. It is not strange that it is not very welcome in the git source tree.
    • At this moment there is one library pending (bcf io_plugin) which have some functions implemented in cython, but alternative python code is also present, so even in worst scenario lack of cython should not be an obstacle for installment.

    Considering above the setup should meet such criteria:

    • do not have hard dependency on cython
    • automatically check if *.pyx files have c files with same name, if not:
      • try to import cythonize from Cython and use on the pyx files
      • if there is no cython present on the system, user should be informed and present of alternatives, but the installation should be not terminated as without cython code hyperspy is still fully functional, just in some minute parts where cython code would be missing it will be a bit slow.
    • if there is cythonized c files present, do not cythonize the code, unless the new "recythonize" command is used: python setup.py recythonize. ~~special user option are added to python setup.py install --force-cythonization (or to develop which will be used for testing env)~~
    • for development purpose if source contains '.git' - generate a post-checkout hook which removes only cythonized '.c/.cpp' and compiled '.so' or '.pyd' files.

    TODO:

    • [x] add trial to cythonize extensions if c/c++ files are missing
    • [x] add new command "recythonize" to be used with setup.py to recythonize all extensions
    • [x] update CI and anaconda configs
    • [x] add git hook to automaticaly recythonize and rebuild --inplace after git checkout
    • [x] add travis ci testing for osx
    • [x] add bdist_wheels deployment to release from travis ci (for osx) and sdist (for linux and other)
    • [x] test travis release works
    • [x] add bdist_wheels deployment to release from appveyor
    • [ ] test bdist_wheels deployment to release works from appveyor (it does not deploy but it builds and preserves the artefacts at appveyor)
    • [x] test python setup.py install works without compiler present (ubuntu live usb: passed; appveyor: too much interfered setup; windows 7 with fresh winpython + traits(from gohlke's place): passed)
    type: new feature 
    opened by sem-geologist 91
  • NEW: bruker's composite file (bcf) hypermap io (read-only) library

    NEW: bruker's composite file (bcf) hypermap io (read-only) library

    The library with reader of bruker composite file (BCF) hypermaps and images.

    TO DO:

    • [x] header parser, SEM imagery, sum EDS spectra to numpy
    • [x] pure python parser functions of elemental mapping to numpy (fuly functioning, but memory inefficient)
    • [x] pure python parser with dynamic loading of file and optional dynamic zlib decompression (required to enable cython based parser)
    • [x] the hyperspy abstract/maped functions of library (file_reader)
    • [x] cython based implementation of Delphi/Bruker packed array fast parsing
    • [x] modify the setup.py, to compile cython code
    • [x] some minimal loading tests
    • [x] integration to hyperspy tests
    type: new feature 
    opened by sem-geologist 86
  • Added support for complex data

    Added support for complex data

    As discussed in #787 I added the two classes for hologram and wave images. As stated there, HologramImage is nearly empty and WaveImage has some basic functionality (convenience properties and a few functions). There are a lot of TODOs in the code which in part hold a few things I could still do, for example making the functions work with image stacks. For this I would like to ask what the proper implementation for iterating over the images in the stack in HyperSpy would be. Any ideas? Other TODOs include names of colleagues of mine (@FWin22 and @woozey), whose tasks it would be (after a successful merge) to fill the two classes with life :-)! I also added a few basic tests. I know this PR is in an early state, but maybe you could give me input what could be improved.


    I'll track the required course of action here:

    • [x] Merge #1074 adding the skimage dependency (this should be merged first, because both other PRs implement this as well).
    • [x] Merge #1089 for ufunc functionality.
    • [x] Merge this PR containing complex functionality.
    • [ ] Merge #1079 containing the HologramImage class, testing and documentation, maintained by @woozey. This should be merged last, because it depends on both prior PRs.
    type: new feature 
    opened by jan-car 79
  • Enh lazy signal

    Enh lazy signal

    [resubmit of #1102 ] Adds a new class of "Lazy" signals that only actually operate on data (and even access it) when required or explicitly told to do so.

    The intended workflow is as follows:

    1. Load the data lazily.
    2. Perform calculations (will most likely be fast / instantaneous) or operations
    3. Save the result signal to a file. Then the operations are performed by dask, and results written to file.

    One of the drawbacks is that the dask.array.Array (which all this PR relies upon) does not support slicing assignment (i.e. a[3] += 2 does not work, have to do something like a = da.concatenate(a[:3], a[3]+2, a[3:])), so not everything works as seamlessly.

    Also, lazy signals mean plotting is slow (since it's calculated on requests), but any-size data should be feasible

    NEW things

    • Lazy*Signal classes: mirror the normal signal classes, but perform the operations lazily.
    • nan*() methods: just like numpy.nanmax and others. Previously were missing...
    • "ONMF" decomposition: Online NMF implementation as per this paper by Zhao et.al. In particular it's the OPGD implementation in the paper. Not yet thoroughly tested
    • as_lazy() method: All signals now have as_lazy() method, which does the obvious. Will probably be used to convert any other signal into its lazy implementation.
    • compute() method: Lazy signals can be converted to conventional ones using this method. The better (and more realistic) workflow saves the end lazy signal to a file (since it presumably does not fit in memory all at once).

    Behaviour Changes

    • CircleROI for LazySignal sets the elements outside the ROI to np.nan instead of using a masked array, since dask does not support masked arrays. Essentially the motivating factor for nan* methods.
    • Lazy loading: deprecated all memmap, mmap, load_to_memory and similar kwargs, and instead left only the new lazy kwarg. This means user does not have to pay attention to the format of the file, and still be able to load it lazily, as it's handled by the format reader.
    • Stacking : Majorly rewritten, can now perform the stacking lazily. Also supports stacking numbers (floats, integers, complex numbers) that get broadcasted as appropriate.
    • get_histogram: lazy implementations do not support knuth' and 'blocks' bins.
    • ragged argument in map: the ragged is necessary for lazy signals, but optional (i.e. can be determined automatically while running) for normal ones. If ragged, the results of the map are not assumed to be of similar shape or even numpy.arrays, and can be any python object.

    Notable code changes (relevant for developers)

    • Rewrote a bunch of iterating algorithms to use signal.map, since it's generalized and works for both conventional and lazy signals.
    • misc.signal_tools.broadcast_signals added. Used in *nary functions, so no need for new tests.
    • lazifyTestClass decorator for test classes. Creates new test_lazy_* methods from all existing test_* methods, where any signal in the class is casted as lazy. Allows reusing most of the regular tests for lazy signals. Can overwrite other class attributes as well.

    TODO:

    • [x] stacking
    • [x] most existing readers can load lazily
    • [x] lazy/iterating decomposition tests (mainly O(R)NMF)
    • [x] bfc lazy loading
    • [x] developer guide update
    • [x] "Big Data" chapter in the user guide
    • [x] Holography signals
    type: new feature 
    opened by to266 78
  • NEW: Implementation of plot_images utility function

    NEW: Implementation of plot_images utility function

    Resolves #452.

    This PR can be used as a starting point for the plot_images function. The majority of the code has been lifted from the plot_decomposition_loadings() method, and adapted so I could plot whichever images I wanted.

    HOLDS:

    • [x] Waiting on #160 before merging

    FOR DEBATE:

    • [x] debate merits of _deepcopy
    • [x] ~~change default tight_layout ?~~

    TODO:

    • [x] add testing notebook to hyperspy demos
    • [x] ~~let x and y axes be undefined and remove if the case~~
    • [x] ~~look into perc with vmin/vmax in image.plot()~~
    • [x] ~~replace example of EDS data with data from pburdet~~
    • [x] ~~remove print statements from image.plot() code~~
    • [x] ~~implement (or not) for 3D RGB images~~
    • [x] ~~docstring like numpy style~~
    • [x] ~~raise ValueError for incorrect scalebar input~~
    • [x] ~~investigate reshaping~~
    • [x] ~~pep8 compliance~~
    • [x] ~~move imports to beginning of file~~
    • [x] ~~Documentation in user guide~~
    • [x] ~~improve labeling of similar signals~~
    • [x] ~~Prevent plot_images() from causing iteration of prior calls to image.plot()~~
    • [x] ~~If images have different scales, they get squeezed. The aspect of plt.imshow should be optimized as it is in image.plot()~~
    • [x] ~~refactor plot_scalebar as scalebar (None, 'all', or list of int choosing which plots should have the scalebar)~~
    • [x] ~~one option for axes (None, 'default' (which is ticks and labels), 'ticks' (no labels))~~
    • [x] ~~Handling of input if a Signal instance is supplied instead of a list (like the output of get_decomposition_loadings()~~ This actually already worked because Signal is iterable (I think), so I just added some checks to make sure it's a Signal if not a list. I'd be happy for suggestions of improvement on this part.
    • [x] ~~Adding option of fixing the colormap so it is shared between all images, rather than having independent ones for all the individual images~~
    • [x] ~~Plot scale bar option~~
    • [x] ~~Plot axes option~~
    • [x] ~~Pass extra args and kwargs to imshow~~
    • [x] ~~Option to disable titles~~
    • [x] ~~Add figure keyword to pass to existing MPL figure~~
    • [x] ~~Set default figsize to k * (per_row, rows) where k = np.max(plt.rcParams['figure.figsize']) / np.max(per_row, rows)~~
    • [x] ~~change default label_list to list of signal titles (metadata.General.title)~~
    • [x] ~~enable plotting when label_list is not the same length, printing useful output to user~~
    • [x] ~~implement single option label, which can be a str or a list~~
    • [x] ~~add support for RGB images~~
    • [x] ~~refactor signals to images~~
    • [x] ~~implement padding option to control spacing between images~~
    • [x] ~~Investigate switching to GridSpec for layout?~~
      • I don't think this is necessary, since using rect in plt.tight_layout solves the issue of overlapping colorbar
    • [x] ~~implement single option for colorbar (None, 'default', or 'single')~~
    • [x] ~~Fix colorbar placement~~
    • [x] ~~default cmap could be the one set generally with pyplot.set_cmap()~~
    • [x] ~~The label of the axis should be the same as image.plot(): "name (units)"~~
    • [x] ~~Should this function return a list of axes as plot_spectra?~~
    type: new feature 
    opened by jat255 64
  • JEOL eds data plugins

    JEOL eds data plugins

    Closes #2257.

    Description of the change

    • upload "jeol.py" in io_plugins to read several files made by JEOL Analysist Station software (".ASW": summary file, ".IMG": image file, ".MAP": first ones is an image file similar to ".IMG" other ones are elemental maps, ".PTS": eds data file, ".EDS": eds spectrum file.
    • add "jeol" reference in init.py

    Future improvements

    • datacube reconstruction is quite slow
    • "Memory Error" output when I try to reconstruct the whole datacube (512x512x4096) so only part of the data are read

    Apologies

    This is my first pull request and I not familiar with github

    Thanks for your attention

    type: new feature release highlight 
    opened by arnduveg 61
  • IPywidgets UI

    IPywidgets UI

    In this PR

    • New IPywidgets User Interface
    • New UI for ROIs
    • traitsui and ipywidgets are now optional
    • New register mechanism to register user interface elements
    • ipywidgets are fully convered by unittests
    • Multiple UI issues fixed.
    • Creating an EELSModel with a spectrum without the required parameters no longer raises a GUI element
    • Deprecate AxesManager.show and Signal1D.integrate_in_range.
    • All functions that take a signal range can now take a roi.SpanRoi.
    • Fix the nbagg backend sluggishness.
    • Improveme interactive plotting speed for all mpl backends.
    • All functions that take a signal range argument now accept a SpanRoi.
    • New signal slicing using ROIs
    • New hs.link_directional and hs.link_bidirectional functions to link traits, traitlets and traits to traitlets.
    • Split the GUI code into two separate packages: https://github.com/hyperspy/hyperspy_gui_traitsui and https://github.com/hyperspy/hyperspy_gui_ipywidget

    To test it

    1. Install ipywidgets and enable them:
    jupyter nbextension enable --py --sys-prefix widgetsnbextension
    
    1. Install https://github.com/hyperspy/hyperspy_gui_traitsui and/or https://github.com/hyperspy/hyperspy_gui_ipywidgets and https://github.com/hyperspy/link_traits .

    2. Then start hyperspy as usual using any matplotlib backend except the inline backend. This PR makes the nbagg backend usable. To use try it:

    
    %matplotlib nbagg
    import hyperspy.api as hs
    
    

    TODO

    Widgets

    • [x] Add ROI gui doc
    • [x] Fix busy nbagg again
    • [x] preferences
    • [x] Spikes removal tool
    • [x] Signal navigation sliders
    • [x] Axes manager gui
    • [x] Data axis gui
    • [x] Model signal range
    • Smoothers
      • [x] Savitzky Golay
      • [x] Lowess
      • [x] TV
    • [x] Background removal
    • [x] Fit component
    • [x] Crop
    • [x] Signal calibration: s.calibrate()
    • [x] SEM/TEM parameter: s.set_microscope_parameters()
    • Load: hs.load() [No equivalent widget -> not implementing]
    • [x] ImageContrastEditor: press h on an image, I actually never used this one...
    • Integrade signal: s.integrate_in_range() (enable use ROI instead)
    • [x] Add progress bar to spikesremovaltool
    • [x] set value slider step
    • [x] Add GUI to set_microscope_parameters
    • [x] Add non-continuous update
    • [x] Use gui instead of get_gui where possible
    • [x] Disable left/right traitsui
    • [x] Add ROIs widgets
    • [x] Add display to methods that call gui internally
    • [x] Fix EDS missing parameters message

    Deprecate:

    • [x] AxesManager.show
    • [x] Integrate in range
    • [x] Crop

    Other tasks

    • [x] Fix setting widget value out of axis bounds
    • [x] Fix remove_background error out of axis limits
    • [x] Add link and dlink to api
    • [x] Add transform to link
    • [x] All gui starts by gui
    • [x] Remove interactive as it is no longer necessary
    • [x] slice signal with SpanRoi and Rectangle2D
    • [x] Tests
      • [x] Axes
      • [x] micro parameters
      • [x] model
      • [x] tools
      • [x] preferences
      • [x] roi
    • [x] Fix existing non bidirectional widgets
    • [x] Fix traitsui color still breaking things
    • [x] Add toolkit option
    • [x] Add preferences option to disable individual toolkits
    • [x] Unlink on close
    • [x] Add register
    • [x] Split traitsui into a separate package
    • [x] Fix long labels
    • signal_range with ROIs
      • [x] Implementation
      • [x] Documentation
      • [x] Test
    • [x] Documentation
    • Manual testing
      • Preferences
        • [x] ipywidgets
        • [x] traitsui
      • AxesManager
        • [x] ipywidgets
        • [x] traitsui
      • Sliders
        • [x] ipywidgets
        • [x] traitsui
      • Smooth lowess
        • [x] ipywidgets
        • [x] traitsui
      • Smooth savitzky
        • [x] ipywidgets
        • [x] traitsui
      • Smooth TV
        • [x] ipywidgets
        • [x] traitsui
      • Background
        • [x] ipywidgets
        • [x] traitsui
      • Spikes removal tool
        • [x] ipywidgets
        • [x] traitsui
      • Calibrate
        • [x] ipywidgets
        • [x] traitsui
      • Image contrast
        • [x] ipywidgets
        • [x] traitsui
      • Model
        • [x] ipywidgets
      • Fit component
        • [x] ipywidgets
        • [x] traitsui
      • Parameter
        • [x] ipywidgets
      • Component
        • [x] ipywidgets
      • EELSCL
        • [x] ipywidgets
      • Scalable Fixed
        • [x] ipywidgets
      • Interactive range
        • [x] ipywidgets
        • [x] traitsui
      • Load
        • [x] traitsui
      • EELS parameters
        • [x] ipywidgets
        • [x] traitsui
      • EDS TEM parameter
        • [x] ipywidgets
        • [x] traitsui
      • EDS SEM parameter
        • [x] ipywidgets
        • [x] traitsui
      • Integrate in rage
        • [x] traitsui

    (am I missing something else?)

    type: new feature 
    opened by francisco-dlp 50
  • Linear fitting

    Linear fitting

    Description of the change

    I've implemented unbounded linear fitting in hyperspy, by multivariate linear regression. This can fit components with free, linear parameters.

    Code coverage is nearly 100%. Closes #574, partially #488.

    Too long, didn't read:

    Run this and be wowed as to how fast it is:

    import hyperspy.api as hs
    import numpy as np
    nav = hs.signals.Signal2D(np.random.random((300, 300)))
    s = hs.datasets.example_signals.EDS_SEM_Spectrum() * nav.T
    m = s.create_model()
    
    m.multifit(optimizer='lstsq') # ~ 5 seconds, regular multifit would take 30 minutes
    

    Linear fitting

    A linear parameter is one that only stretches the component vertically, so it has no effect on the position or otherwise shape of the component. A single hyperspy component can contain several of these, which can be a bit confusing, so let me explain with some examples:

    Examples of linear and nonlinear parameters

    Click to expand!
    • A*x, the simplest case, scales the function f(x)=x by the linear parameter A.

    A hyperspy model contains a linear combination of components. Due to the Expression component, a component can contain what I call "pseudo-components" which are themselves linear combinations, like the following example:

    • A*x**2 + b*x + c.

    Here all parameters are considered linear. One could also have added these separately as three hyperspy components.

    • In A*sin(B*x-C), A is linear, B and C are not linear because they scale and shift the component horizontally
    • In A*exp(-b*x), A is linear, b is not linear

    HyperSpy automatically recognizes linear parameters using sympy. The properties Component.is_linear and Parameter._is_linear are automatically set for Expression components. m.linear_parameters and m.nonlinear_parameters help the user understand which parameters are considered linear.

    New optimizers

    • optimizer='lstsq': Uses either np.linalg.lstsq or dask.array.linalg.lstsq, depending if the model is lazy.
    • optimizer='ridge_regression: Uses sklearn.linear_model.Ridge, which supports regularization but does not support lazy signals.

    How to use

    The main difference between regular nonlinear fitting and linear fitting is that the components must all be made linear. m.fit(optimizer='lstsq') will inform the user of any free nonlinear parameters so that they can be set with parameter.free=False or m.set_parameters_not_free(only_nonlinear=True).

    Since linear fitting does not rely on the fit in the previous pixel, we can fit the entire dataset in one vectorised operation. This has major consequences for lazy loading - fitting a 25 GB dataset with 25 components takes 2 minutes (see lazy example).

    If one requires the standard errors on parameters, then that must be specified for multifit calls because the np.linalg.inv operation required during error calculation may take signficantly longer than the fitting on its own. It is specified as m.multifit(optimizer='lstsq', calculate_errors=True). I am open to suggestions for alternative ways of handling this.

    Finally, if the nonlinear parameters are set and vary across the navigation space (through para.map['values']), then multifit will default to the regular behaviour of iterating through the navigator, rather than fitting everything at once. This is because the simultaneous fitting assumes that all nonlinear parameters are the same across the navigator.

    Progress of the PR

    • [x] Change implemented (can be split into several points),
    • [x] update docstring (if appropriate),
    • [x] update user guide (if appropriate),
    • [x] add tests,
    • [x] ready for review.

    Minimal examples

    Regular fit

    Approximately as fast as optimizer='lm'

    from hyperspy.datasets.example_signals import EDS_SEM_Spectrum
    s = EDS_SEM_Spectrum()
    m = s.create_model()
    m.fit(optimizer='lstsq')
    

    Multifit

    About 200 times faster than optimizer='lm' for a small example. Replace the random shape below with (320, 320) and it is 750 times faster.

    import hyperspy.api as hs
    import numpy as np
    nav = hs.signals.Signal2D(np.random.random((32, 32)))
    s = hs.datasets.example_signals.EDS_SEM_Spectrum() * nav.T
    m = s.create_model()
    m.multifit(optimizer='lstsq')
    # We would use `calculate_errors=True` if errors were required, which takes a bit longer
    

    When is linear fitting useful?

    • When you only have free linear parameters in your model
    • When you want a (hopefully decent) starting point for further nonlinear fitting
    • When you have many components in your model
    • You need to fit data much bigger than memory, but regular multifit takes too long

    Applications of linear fitting relevant to hyperspy

    • EDS fitting (fits the entire dataset, including background)
    • EELS fitting after background subtraction (Powerlaw could be fitted by taking log of data)
    • Fitting 2D atomic columns with known positions

    More examples (convolution and large lazy example)

    Click to expand!

    Convolution EELS example

    This works great for convolved data as well

    s = hs.datasets.artificial_data.get_core_loss_eels_signal()
    s.add_elements(['Fe'])
    ll = hs.datasets.artificial_data.get_low_loss_eels_signal()
    
    m = s.create_model(ll=ll, auto_background=False)
    m[0].onset_energy.value = 690.
    m.fit(optimizer='lstsq')
    s_linear = m.as_signal()
    
    m = s.create_model(ll=ll, auto_background=False)
    m[0].onset_energy.value = 690.
    m.fit(optimizer='lm')
    s_nonlinear = m.as_signal()
    
    np.testing.assert_array_almost_equal(s_linear.data, s_nonlinear.data)
    

    Lazy fitting

    This is a real strength of this PR: Fitting 3 million nav positions takes two minutes.

    import hyperspy.api as hs
    from hyperspy.axes import UniformDataAxis
    import dask.array as da
    
    from hyperspy.datasets.example_signals import EDS_SEM_Spectrum
    from hyperspy._signals.eds_sem import LazyEDSSEMSpectrum
    from hyperspy._signals.signal2d import LazySignal2D
    
    s = EDS_SEM_Spectrum()
    data = s.data
    axis = UniformDataAxis(offset = -0.1, scale = 0.01, size = 1024, units="keV")
    s2 = LazyEDSSEMSpectrum(data, axes = [axis])
    s2.add_elements(s.metadata.Sample.elements)
    s2.set_microscope_parameters(beam_energy=10.)
    
    nav = LazySignal2D(da.random.random((3000, 1000)))
    s = s2 * nav.T
    
    s.save("lazy.hspy", compression=None, overwrite=True, chunks = (30, 1000, 1024)) # This will take a few minutes
    
    # Then run these: 
    s = hs.load('lazy.hspy', lazy=True)
    m = s.create_model(auto_add_lines=False) # False temporarily due to #2806
    m.multifit(optimizer='lstsq')
    # Here we would use `calculate_errors=True` if errors were required, 
    # but that aspect isn't lazy, so we would use a lot of ram. 
    
    affects: documentation type: new feature affects: tests release highlight 
    opened by thomasaarholt 46
  • Proposal--HSEP: 4 Adding support for Labeled Columns/Advanced Slicing, Ragged Signal of Signals and Advanced Markers

    Proposal--HSEP: 4 Adding support for Labeled Columns/Advanced Slicing, Ragged Signal of Signals and Advanced Markers

    Hopefully this is the right thing to do. I wanted to write out my entire plan before I start making too much of a final push so that people can review it beforehand. This is a large change that touches many different parts of the code but should have fairly minimal changes to the overall working/syntax. It should also clean up/speed up some functionality that has been lacking.

    Describe the functionality you would like to see.

    I apologize for the large number of issues/Pull requests that I have created over the last couple of months. Admittedly there was some discovery in many parts of this, #3076 and #3075 as well as #3055 and #3031 are all relevant. This started from a desire to rework the diffraction spot finding in pyxem https://github.com/pyxem/pyxem/issues/872. Many of the features there are broken or unusable with large datasets. This is because of how the hs.Signal2d.find_peaks function is written as well as how the marker class handles plotting multiple different artists. Additionally, the lack of native support for column labeled signals becomes a large problem when trying to produce an end to end workflow for this type of analysis and maintain the high standards hyperspy has set for metadata, axes management and strict definition of data. (Not that this is a bad thing.)

    The desired workflow would be:

    1. Use interactive tools to determine location of important features for a signal diffraction pattern
    2. Find important features in all of the images.
    3. Plot those features on the original dataset
    4. Iterate 1-3 until proper convergence/ fitting occurs.
    5. Refine and manipulate the important features, create figures, analyze columns etc.

    For large datasets streamlining the iteration is very important and often lazy workflows are extremely beneficial as small parts of the data can be analyzed and observed without requiring the entire calculation to be repeated.

    In a more simplified context the features I would like to add are:

    1. Support for labeled columns in some Axes object. This includes the ability to slice signals using the axes label. This is very similar to how pandas or xarray allow for labeled column values.
    2. Defining markers as a ragged array with columns defining marker attributes and a variable number of rows defining the number of points.
    3. Support for ragged datasets with dtype= hs.BaseSignal
    4. Add an as_type parameter to the map function if the signal should be cast to a different signal type.

    Describe the context:

    Describing each of the features that I would like to add in more detail:

    1. Reworking the Axes class to support labeled columns.

    There is already a bunch of good discussion in #3055 as well as #3031 but to formally state my objectives.

    • Allow for adding labels to the BaseAxis.axis property so that signals can be sliced using non-numerical values. An example of this is s.isig["x"].

    • Allow for slicing using multiple values such as s.isig[["x", "y"]]. This requires that we are more strict about how we follow numpy and their established Advanced indexings. In sort this means that tuples and lists/ arrays are treated differently. This also allows for slicing using single dimension boolean arrays (or lists).

    • Possible Inclusion: Allow slicing with a multidimensional boolean array: This is much more difficult to handle in regard to maintaining axes values. We can create a new axis but we would lose the axes information which might be of interest. I would suggest against allowing multidimensional boolean arrays in favor of maybe adding a generic boolean ROI.

    • Possible Inclusion: _I also want to add a special kind of label which includes the offset/scale for some index. The idea being that both the real and pixel values for some point are both saved. This is very useful when you are trying to use both values to further analyze the pixelated image and use the calibrated value.

    2. . Defining markers as a ragged array

    This speeds up plotting of markers as it reduces the number of artists necessary to plot some marker. It also allows for marker to be initialized from lazy, nonlazy signals without reshaping the data into a compositely different form.

    • This will redefine markers as either a ragged array (if the markers change for each navigation position) or as a non ragged array of markers to be applied to a signal
    • If the maker.data attribute is ragged then at each ndimensional index there will be an array with `dtype = [('x1', (float, size)), ('y1', (float, size)), ('x2', (float, size)), ('y2', (float, size)),('text', (string, size)), ('size', (float, size))
    • The marker.data attribute can be a dask array. If the data attribute is lazy then the values are cached similar to how plotting lazy signals caches the data.
    • Markers can be initialized using existing methods/workflows but additionally have the option to be initialized in bulk.
    • Markers are sliced whenever signal.inav is used to slice the data. A copy of the marker is created and passed to the new signals.
    • Possible Inclusion: _Originally I wanted the marker class to extend the BaseSignal class I am still not entirely convinced that isn't the best thing to do.
      • It simplifies the slicing, makes the navigation axes consistent and makes it easier to change/ adjust markers in the plot.
      • Things like shifting markers when necessary is easier as the markers can be adjusted inplace.
      • Using the addition in 1 the labels can be more easily identified/ adjusted etc.
      • It majorly simplifies the workflow of finding some feature --> plotting the feature as markers are just extensions of signals. It is also much easier to create lazy markers etc. _

    3. Support for ragged datasets with dtype= hs.BaseSignal

    Finally I propose support for ragged signals to allow for a dtype = BaseSignal and for that to be a special type of ragged signal.

    This allows for:

    • Initialization of ragged arrays with underlying information about their axes (i.e. Vectors associated with diffraction spots or interesting features)
    • Saving interesting features (i.e. some slice of an image) as signals with all the necessary data.

    Additional Aside: This is an extension of a couple of discussions that I have had with @ericpre over the last year or so. I would like for a ragged signal to return a non-ragged signal when the dtype is equal to some subclass of BaseSignal. This is the default numpy behavior and it allows a couple of very useful things.

    For example the following

    x = np.empty(2, dtype=object)
    
    x[0]= np.ones(10)
    x[1]= np.ones(15)
    x[1][5]=2 # access and the inside ragged array. 
    

    It also allows for operating on each index individually in a non ragged fashion and plotting single indexes of some ragged signal using the syntax s.inav[2,2].plot().

    My proposal is to allow ragged signals to return non ragged signals if the s.data object has a dtype= BaseSignal. In this case it no longer makes sense to maintain the ragged array when instead a signal can be returned.

    Implementing this requires some fairly basic changes to handling ragged signals and some small changes regarding saving. I would like for only one copy of the metadata/original metadata to be saved/ loaded to reduce the requirements for saving the dataset and then passed on when the data is sliced. That should be fairly easy to manage. The map function will also have to be changed in order just pass the data through without the Signal.

    Additional information

    Putting this all together (at least for my purposes) these changes should allow us to:

    
    s # Diffraction Signal 2d
    
    peaks = s.find_peaks(method="template_matching", template=disk_template, threshold=0.1, as_vector=True) 
    peaks # Ragged Signal, sub array of signal2D
    
    peaks.inav[2.2] # Signal 2D 
    
    peaks.inav[2,2].axes_manager[1].labels  # ["kx", "ky"]
    peaks.inav[2,2].isig["kx"] # Returns all of the "kx" vectors
    
    
    marker = peaks.to_marker()
    marker.navigation_shape == s.axes_manager.navigation_shape # True
    s.add_marker(marker, permanent=True)
    
    slic = s.inav[0:20,0:20]
    
    slic.markers[0].navigation_shape == slic.axes_manager.navigation_shape
    
    slic.plot() # plot a subset of the markers
    
    

    For a more generic workflow

    peaks = s.find_peaks(method="template_matching", template=disk_template, threshold=0.1, return_indexes=False)
    
    markers = hs.plot.markers.PointMarker(data=peaks.data) 
    
    s.add_markers(markers)
    s.plot()
    
    opened by CSSFrancis 0
  • Bump actions/checkout from 3.1.0 to 3.2.0

    Bump actions/checkout from 3.1.0 to 3.2.0

    Bumps actions/checkout from 3.1.0 to 3.2.0.

    Release notes

    Sourced from actions/checkout's releases.

    v3.2.0

    What's Changed

    New Contributors

    Full Changelog: https://github.com/actions/checkout/compare/v3...v3.2.0

    Changelog

    Sourced from actions/checkout's changelog.

    Changelog

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • Vectorized Markers

    Vectorized Markers

    Description of the change

    Change described by #3075

    • Allows ragged point Markers initialized
    • Allowed ragged line segment markers initialized.
    • Speeds up plotting many markers by ~100 times for plots with 100 or more active markers.
    • Allows for ragged initialization of points/ markers.

    Progress of the PR

    • [x] Change implemented (can be split into several points),
    • [ ] update docstring (if appropriate),
    • [ ] update user guide (if appropriate),
    • [ ] add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
    • [ ] Check formatting changelog entry in the readthedocs doc build of this PR (link in github checks)
    • [x] add tests,
    • [x] ready for review.

    Minimal example of the bug fix or the new feature

    x= np.empty((10,10), dtype=object)
    y= np.empty((10,10), dtype=object)
    
    points = np.random.randint(0, 100, (2,10,10, 500))
    
    for i in np.ndindex(10,10):
        x_ind = (0,)+i
        y_ind = (1,)+i
        x[i]=points[x_ind]
        y[i]=points[y_ind]
    
    
    s = hs.signals.Signal2D(np.random.random((10,10,100,100)))
    markers =hs.plot.markers.Point(x =x,y=y,color="blue",)   # only adds one ragged marker object to the dataset. 
    s.add_marker(markers,plot_marker=False, permanent=True, render_figure=False)
    s.plot(vmin=0, vmax=1)
    
    opened by CSSFrancis 3
  • Plotting many Markers is slow

    Plotting many Markers is slow

    Describe the bug

    I've found that plotting a lot of markers can be very slow. The problem seems to be a bug where a linear increase in the number of artists results in an exponential increase in the plotting time for some signal. The result is a sluggish interface. This is part of the reason that things like the find peaks interface can appear sluggish, especially when there are many different peaks identified.

    I discussed this a little bit in #3031

    For example when I run the code:

    num_points =  400
    points = np.random.randint(0, 100, (num_points, 2,10,10))
    
    %%timeit
    s = hs.signals.Signal2D(np.random.random((10,10,100,100)))
    markers = [ hs.plot.markers.Point(x = p[0],y= p[1],color="blue",) for p in points]  
    for m in markers:
        s.add_marker(m,plot_marker=False, permanent=True, render_figure=False)
    s.plot(vmin=0, vmax=1)
    

    It takes around around .2ms to plot with 10 markers. Most of this time is related to setting up the signal. For 200 markers that time has increased to 1.86 seconds. Most of which is related to the plot function, more specifically the underlying _plot_signal function.

    download

    The issue is related to the repeated calls to the plt.scatter function. The issue is similar when adding many patches to the plot as is the case with adding arrows. For the most part it is less probable that someone is adding 100's of arrows to a plot. As opposed to adding 100's of points or lines.

    Expected behavior

    Markers should behave in a way similar to ragged signals in hyperspy. Not only does this streamline the progression from finding a feature --> creating an marker --> visualization but it greatly speeds up visualization by reducing the number of markers.

    It also reduces the dependency that each navigation position needs to have the same number of markers. A fact that as we see above is an extreme determent to plotting speed.

    For the point marker the change in the code is relatively small and unbreaking.

    If we change this line of code:

    https://github.com/hyperspy/hyperspy/blob/7f6b448b91da35b71774a52860c95ba69deeb41c/hyperspy/drawing/_markers/point.py#L89-L90

    To:

    self.marker.set_offsets(np.squeeze(np.transpose([self.get_data_position('x1'),self.get_data_position('y1')])))
    

    The plotting time doesn't change with the number of markers and we can also use this syntax to plot markers.

    x= np.empty((10,10), dtype=object)
    y= np.empty((10,10), dtype=object)
    
    points = np.random.randint(0, 100, (2,10,10, 500))
    
    for i in np.ndindex(10,10):
        x_ind = (0,)+i
        y_ind = (1,)+i
        x[i]=points[x_ind]
        y[i]=points[y_ind]
    
    
    s = hs.signals.Signal2D(np.random.random((10,10,100,100)))
    markers =hs.plot.markers.Point(x =x,y=y,color="blue",)   # only adds one ragged marker object to the dataset. 
    s.add_marker(markers,plot_marker=False, permanent=True, render_figure=False)
    s.plot(vmin=0, vmax=1)
    

    Additional context

    This does cause a problem with saving markers similar to #2904 but we could solve the problem in much the same way.

    If we want the same increase in speed to plotting lines (@hakonanes you are probably very interested in this) the equivalent idea would be to replace the set of Lines2D with a PolyCollection.

    I did some testing of that and the speedups are very comparable depending on the number of line segments.

    I can make a PR with this change. I think this will really clean up some peakfinding/fitting workflows and make them seem much faster without much underlying change to the code.

    type: bug 
    opened by CSSFrancis 0
  • Bugfix: Do not raise AttributeError when loading markers

    Bugfix: Do not raise AttributeError when loading markers

    Description of the change

    • Do not raise AttributeError when loading markers (deleting uninitialized marker dictionary)

    Progress of the PR

    • [x] Change implemented (can be split into several points),
    • [ ] update docstring (if appropriate),
    • [ ] update user guide (if appropriate),
    • [ ] add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
    • [ ] Check formatting changelog entry in the readthedocs doc build of this PR (link in github checks)
    • [ ] add tests,
    • [ ] ready for review.

    Minimal example of the bug fix or the new feature

    need Markers from rosettasciio

    opened by nem1234 3
Releases(v1.7.3)
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 07, 2021
A tax calculator for stocks and dividends activities.

Revolut Stocks calculator for Bulgarian National Revenue Agency Information Processing and calculating the required information about stock possession

Doino Gretchenliev 200 Oct 25, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022
Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge S

1 Jan 09, 2022
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 06, 2021
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 03, 2023
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
ASTR 302: Python for Astronomy (Winter '22)

ASTR 302, Winter 2022, University of Washington: Python for Astronomy Mario Jurić Location When: 2:30-3:50, Monday & Wednesday, Winter quarter 2022 Wh

UW ASTR 302: Python for Astronomy 4 Jan 12, 2022
A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilis

Blei Lab 4.7k Jan 09, 2023
Feature engineering and machine learning: together at last

Feature engineering and machine learning: together at last! Lambdo is a workflow engine which significantly simplifies data analysis by unifying featu

Alexandr Savinov 14 Sep 15, 2022
Validation and inference over LinkML instance data using souffle

Translates LinkML schemas into Datalog programs and executes them using Souffle, enabling advanced validation and inference over instance data

Linked data Modeling Language 7 Aug 07, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022