Apache (Py)Spark type annotations (stub files).

Last update: Nov 22, 2022

Overview

PySpark Stubs

A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints.

Tests and configuration files have been originally contributed to the Typeshed project. Please refer to its contributors list and license for details.

Important

This project has been merged with the main Apache Spark repository (SPARK-32714). All further development for Spark 3.1 and onwards will be continued there.

For Spark 2.4 and 3.0, development of this package will be continued, until their official deprecation.

If your problem is specific to Spark 2.3 and 3.0 feel free to create an issue or open pull requests here.
Otherwise, please check the official Spark JIRA and contributing guidelines. If you create a JIRA ticket or Spark PR related to type hints, please ping me with [~zero323] or @zero323 respectively. Thanks in advance.

Motivation

Static error detection (see SPARK-20631)
Improved autocompletion.

Installation and usage

Please note that the guidelines for distribution of type information is still work in progress (PEP 561 - Distributing and Packaging Type Information). Currently installation script overlays existing Spark installations (pyi stub files are copied next to their py counterparts in the PySpark installation directory). If this approach is not acceptable you can add stub files to the search path manually.

According to PEP 484:

Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH.

Moreover:

Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH. A default fallback directory that is always checked is shared/typehints/python3.5/ (or 3.6, etc.)

Please check usage before proceeding.

The package is available on PYPI:

pip install pyspark-stubs

and conda-forge:

conda install -c conda-forge pyspark-stubs

Depending on your environment you might also need a type checker, like Mypy or Pytype [1], and autocompletion tool, like Jedi.

Editor	Type checking	Autocompletion	Notes
Atom	✔ [2]	✔ [3]	Through plugins.
IPython / Jupyter Notebook	✘ [4]	✔
PyCharm	✔	✔
PyDev	✔ [5]	?
VIM / Neovim	✔ [6]	✔ [7]	Through plugins.
Visual Studio Code	✔ [8]	✔ [9]	Completion with plugin
Environment independent / other editors	✔ [10]	✔ [11]	Through Mypy and Jedi.

This package is tested against MyPy development branch and in rare cases (primarily important upstrean bugfixes), is not compatible with the preceding MyPy release.

PySpark Version Compatibility

Package versions follow PySpark versions with exception to maintenance releases - i.e. pyspark-stubs==2.3.0 should be compatible with pyspark>=2.3.0,<2.4.0. Maintenance releases (post1, post2, ..., postN) are reserved for internal annotations updates.

API Coverage:

As of release 2.4.0 most of the public API is covered. For details please check API coverage document.

Disclaimer

Apache Spark, Spark, PySpark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This project is not owned, endorsed, or sponsored by The Apache Software Foundation.

Footnotes

[1]	Not supported or tested.

[2]	Requires atom-mypy or equivalent.

[3]	Requires autocomplete-python-jedi or equivalent.

[4]	It is possible to use magics to type check directly in the notebook. In general though, you'll have to export whole notebook to .py file and run type checker on the result.

[5]	Requires PyDev 7.0.3 or later.

[6]	TODO Using vim-mypy, syntastic or Neomake.

[7]	With jedi-vim.

[8]	With Mypy linter.

[9]	With Python extension for Visual Studio Code.

[10]	Just use your favorite checker directly, optionally combined with tool like entr.

[11]	See Jedi editor plugins list.

Comments

Fix 2-argument math functions
Fixes the binary math functions:

atan2 and hypot take two arguments, not one

pow supports taking a literal numeric value as its second argument in addition to a Column.

bug 3.0 2.3 2.4
opened by harpaj 10

Jedi doesn't work with MLReaders

It seems like there is some problem with Jedi compatibility. Some components seem to work pretty well. For example DataFrame without stubs:

In [1]: import jedi                                                                                                                                                                                                

In [2]: from pyspark.sql import SparkSession                                                                                                                                                                       

In [3]: jedi.Interpreter("SparkSession.builder.getOrCreate().createDataFrame([]).", [globals()]).completions()                                                                                                     
---------------------------------------------------------------------------
AttributeError   
...
AttributeError: 'ModuleContext' object has no attribute 'py__path__'

and with stubs:

In [1]: from pyspark.sql import SparkSession                                                                                                                                                                       

In [2]: import jedi                                                                                                                                                                                                

In [3]: jedi.Interpreter("SparkSession.builder.getOrCreate().createDataFrame([]).", [globals()]).completions()                                                                                                     
Out[3]: 
[<Completion: agg>,
 <Completion: alias>,
 <Completion: approxQuantile>,
 <Completion: cache>,
 <Completion: checkpoint>,
 <Completion: coalesce>,
 <Completion: collect>,
 <Completion: colRegex>,
 <Completion: columns>,
 <Completion: corr>,
 <Completion: count>,
 <Completion: cov>,
...
 <Completion: __str__>]

So far so good. However, if take for example LinearRegressionModel.load things don't work so well. Without stubs provides no suggestions

In [1]: import jedi                                                                                                                                                                                                

In [2]: from pyspark.ml.regression import LinearRegressionModel                                                                                                                                                    

In [3]: jedi.Interpreter("LinearRegressionModel.load('foo').", [globals()]).completions()                                                                                                                          
Out[3]: []

but one provided with stubs

In [1]: import jedi                                                                                                                                                                                                

In [2]: from pyspark.ml.regression import LinearRegressionModel                                                                                                                                                    

In [3]: jedi.Interpreter("LinearRegressionModel.load('foo').", [globals()]).completions()                                                                                                                          
Out[3]: 
[<Completion: load>,
 <Completion: read>,
 <Completion: __annotations__>,
 <Completion: __class__>,
 <Completion: __delattr__>,
 <Completion: __dict__>,
 <Completion: __dir__>,
 <Completion: __doc__>,
 <Completion: __eq__>,
 <Completion: __format__>,
 <Completion: __getattribute__>,
 <Completion: __hash__>,
 <Completion: __init__>,
 <Completion: __init_subclass__>,
 <Completion: __module__>,
 <Completion: __ne__>,
 <Completion: __new__>,
 <Completion: __reduce__>,
 <Completion: __reduce_ex__>,
 <Completion: __repr__>,
 <Completion: __setattr__>,
 <Completion: __sizeof__>,
 <Completion: __slots__>,

don't make much sense. If model is fitted:

In [4]: from pyspark.ml.regression import LinearRegression                                                                                                                                                         

In [5]: jedi.Interpreter("LinearRegression().fit(...).", [globals()]).completions()                                                                                                                                
Out[5]: 
[<Completion: aggregationDepth>,
 <Completion: append>,
 <Completion: clear>,
 <Completion: coefficients>,
 <Completion: copy>,
 <Completion: count>,
....
 <Completion: __str__>]

Model which is explicitly annotated works fine, so it seems like there is something in MLReader or one of the sub-classes that causes a failure.

We already have data tests for this (as well as some test cases from apache/spark examples, and mypy seems to be fine with this.

Since LinearRegression.fit works fine (and some toy tests confirm that), Generics are not sufficient to reproduce the problem. So it seems like type parameter is not processed correctly on the path:

Tested with:

jedi==0.15.2 and jedi==0.16.0 (0c56aa4).
pyspark-stubs==3.0.0.dev5
pyspark==3.0.0.dev0 (afe70b3)

opened by zero323 7

DataFrameReader.load parameters incorrectly expected all to be strings
Using 2.4.0.post6

spark.read.load(folders, inferSchema=True, header=False)

mypy reports Expected type 'str', got 'bool' instead for both inferSchema and header.

Looks like the issue is in third_party/3/pyspark/sql/readwriter.pyi Line 23 where in the definition for load() we have **options: str. For csv suppport this needs to be **options: Optional[Union[bool, str, int]] but to handle the general case it probably needs to be **options: Any.
enhancement
opened by ghost 7
Added contains to Column

The contains method is missing from the stubs causing mypy to raise error: "Column" not callable.

This PR adds the typehints to 2.4 specifically (the version we are using), but it should probably also be added to the other versions.

opened by Braamling 6
#394: Use Union[List[Column], List[str]] for Select

Passing a List[str] to select raises a mypy warning, similar for List[Column]. We change the type from List[Union[Column, str]] to Union[List[Column], List[str]].

Fixes #394 .

opened by jhereth 5

Update distinct() and repartition() definitions

Update repartition functions to allow for Col in numPartitions parameter.

Reference

numPartitions – can be an int to specify the target number of partitions or a Column.
    If it is a Column, it will be used as the first partitioning column.
    If not specified, the default number of partitions is used.

Also add stub for DataFrame#distinct()

opened by zpencerq 5

Allow `Column` type for timezone argument in pyspark.sql.functions
In the functions here: https://github.com/zero323/pyspark-stubs/blob/3c4684a224c1be4eea4577e475f8bb4d045edddd/third_party/3/pyspark/sql/functions.pyi#L100-L101 we currently have tz: str but this can also be specified as a Column

Example:

>>> from pyspark.sql import functions >>> df = spark.sql("SELECT CAST(0 AS TIMESTAMP) AS timestamp, 'Asia/Tokyo' AS tz") >>> df.select(functions.from_utc_timestamp(df.timestamp, df.tz)).collect() [Row(from_utc_timestamp(timestamp, tz)=datetime.datetime(1970, 1, 1, 18, 0))]

I think this could be expanded to tz: ColumnOrName?
3.0 2.4 3.1
opened by charlietsai 4
Overload DataFrame.drop: sequences must be *str

The method DataFrame.drop expects either 1 Column, or 1 str, or an iterable of strings. This is only type checked inside the function though.

Currently the type hints (and the actual API) allow to pass multiple Columns but it does result in a runtime error. Personally, I'd like to have that caught earlier. But as this might be getting too close to the internals of the functions, I’d like to hear your opinion on whether or not the type hints should “look inside” to aid development.

opened by oliverw1 4
provide overloaded methods for sample

The fraction is a required argument to the sample method. Anytime someone calls df.sample(.01) this is met in mypy with

Argument 1 to "sample" of "DataFrame" has incompatible type "float"; expected "Optional[bool]"

In the Pyspark API, the three arguments are in fact pure keyword arguments that are handled later to ensure fraction must be given. This is probably done to keep consistent with the Scala API.

By overloading the methods, the issue is resolved.

opened by oliverw1 4
Allow non-string load/save parameters

Resolves #273

Additional parameters to DataFrameReader.load() and DataFrameWriter.save()/.saveTable() are passed to the file-type specific reader or writer types. These parameters can be of any type.

opened by mark-oppenheim 4
Fix return type for DataFrame.groupBy / cube / rollup

2.3 has these data types and I was erroneously gettting errors for them.

Note this is a port of e2d225f06ff36fcbf79e2123f1c18f380e862728

I tried a cherry-pick but it had some issues (not sure why)

opened by dangercrow 4

Releases(3.0.0.post3)

3.0.0.post3(Jan 7, 2022)

Source code(tar.gz)
Source code(zip)
3.0.0.post2(Jan 7, 2022)

Source code(tar.gz)
Source code(zip)
3.0.0.post1(Sep 15, 2020)

Source code(tar.gz)
Source code(zip)
2.4.0.post9(Sep 14, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0(Jul 18, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev8(Apr 28, 2020)

Source code(tar.gz)
Source code(zip)
2.4.0.post8(Apr 28, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev7(Apr 28, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev6(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev5(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev4(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev3(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev1(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
3.0.0.dev0(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
2.4.0.pre4(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
2.4.0.pre3(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
2.4.0.pre2(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)
2.4.0.pre1(Feb 3, 2020)

Source code(tar.gz)
Source code(zip)

Owner

Maciej

Just a dog on the Internet. I would love to tell you more, but then, of course, I'd have to erase your memory. A30CEF0C31A501EC

GitHub Repository

Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc)

Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc). Structured a custom ensemble model and a neural network. Found a outperformed

1 Feb 06, 2022

Magenta: Music and Art Generation with Machine Intelligence

Magenta is a research project exploring the role of machine learning in the process of creating art and music. Primarily this involves developing new

18.1k Dec 30, 2022

Apache (Py)Spark type annotations (stub files).

PySpark Stubs A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints. T

114 Nov 22, 2022

A Streamlit demo to interactively visualize Uber pickups in New York City

Streamlit Demo: Uber Pickups in New York City A Streamlit demo written in pure Python to interactively visualize Uber pickups in New York City. View t

230 Dec 28, 2022

Predicting job salaries from ads - a Kaggle competition

57 Oct 23, 2020

Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

1 Jan 05, 2022

🎛 Distributed machine learning made simple.

🎛 lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

44 Nov 27, 2022

Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

Flight_fare-Prediction It is a Flight_fare Prediction Web Application Using Machine learning,Python and flask Using Machine leaning i have created a F

1 Dec 06, 2022

Data Efficient Decision Making

197 Jan 06, 2023

dirty_cat is a Python module for machine-learning on dirty categorical variables.

dirty_cat dirty_cat is a Python module for machine-learning on dirty categorical variables.

637 Dec 29, 2022

MLFlow in a Dockercontainer based on Azurite and Postgres

mlflow-azurite-postgres docker This is a MLFLow image which works with a postgres DB and a local Azure Blob Storage Instance (Azurite). This image is

2 May 29, 2022

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

619 Dec 14, 2022

Automated Time Series Forecasting

AutoTS AutoTS is a time series package for Python designed for rapidly deploying high-accuracy forecasts at scale. There are dozens of forecasting mod

652 Jan 03, 2023

Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

120 Dec 24, 2022

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

366 Jan 03, 2023

Implementation of linesearch Optimization Algorithms in Python

Nonlinear Optimization Algorithms During my time as Scientific Assistant at the Karlsruhe Institute of Technology (Germany) I implemented various Opti

3 Dec 06, 2022

A Time Series Library for Apache Spark

Flint: A Time Series Library for Apache Spark The ability to analyze time series data at scale is critical for the success of finance and IoT applicat

970 Jan 04, 2023

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

15.4k Jan 07, 2023

Estudos e projetos feitos com PySpark.

PySpark (Spark com Python) PySpark é uma biblioteca Spark escrita em Python, e seu objetivo é permitir a análise interativa dos dados em um ambiente d

54 Nov 06, 2022

Anytime Learning At Macroscale

On Anytime Learning At Macroscale Learning from sequential data dumps (key) Requirements Python 3.7 Pytorch 1.9.0 Hydra 1.1.0 (pip install hydra-core

8 Mar 29, 2022

Apache (Py)Spark type annotations (stub files).

Related tags

Overview

PySpark Stubs

Important

Motivation

Installation and usage

PySpark Version Compatibility

API Coverage:

See also

Disclaimer

Footnotes

Comments

Releases(3.0.0.post3)

3.0.0.post3(Jan 7, 2022)

3.0.0.post2(Jan 7, 2022)

3.0.0.post1(Sep 15, 2020)

2.4.0.post9(Sep 14, 2020)

3.0.0(Jul 18, 2020)

3.0.0.dev8(Apr 28, 2020)

2.4.0.post8(Apr 28, 2020)

3.0.0.dev7(Apr 28, 2020)

3.0.0.dev6(Feb 3, 2020)

3.0.0.dev5(Feb 3, 2020)

3.0.0.dev4(Feb 3, 2020)

3.0.0.dev3(Feb 3, 2020)

3.0.0.dev1(Feb 3, 2020)

3.0.0.dev0(Feb 3, 2020)

2.4.0.pre4(Feb 3, 2020)

2.4.0.pre3(Feb 3, 2020)

2.4.0.pre2(Feb 3, 2020)

2.4.0.pre1(Feb 3, 2020)

Owner

Maciej

Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc)

Magenta: Music and Art Generation with Machine Intelligence

Apache (Py)Spark type annotations (stub files).

A Streamlit demo to interactively visualize Uber pickups in New York City

Predicting job salaries from ads - a Kaggle competition

Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

🎛 Distributed machine learning made simple.

Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

Data Efficient Decision Making

dirty_cat is a Python module for machine-learning on dirty categorical variables.

MLFlow in a Dockercontainer based on Azurite and Postgres

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Automated Time Series Forecasting

Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Implementation of linesearch Optimization Algorithms in Python

A Time Series Library for Apache Spark

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Estudos e projetos feitos com PySpark.

Anytime Learning At Macroscale