Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Overview

PyPI - Python Version PyPI Status PyPI Status License: LGPL v3 Continuous Integration codecov

Download

Rate on Openbase

Here is deepparse.

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning.

Use deepparse to

  • Use the pre-trained models to parse multinational addresses,
  • retrain our pre-trained models on new data to parse multinational addresses,
  • retrain our pre-trained models with your own prediction tags easily,
  • retrain a new seq2seq addresses parsing models easily.

Read the documentation at deepparse.org.

Deepparse is compatible with the latest version of PyTorch and Python >= 3.7.

Countries and Results

We evaluate our models on two forms of address data

  • clean data which refers to addresses containing elements from four categories, namely a street name, a municipality, a province and a postal code,
  • incomplete data which is made up of addresses missing at least one category amongst the aforementioned ones.

You can get our dataset here.

Clean Data

The following table presents the accuracy (using clean data) on the 20 countries we used during training for both our models.

Country Fasttext (%) BPEmb (%) Country Fasttext (%) BPEmb (%)
Norway 99.06 98.3 Austria 99.21 97.82
Italy 99.65 98.93 Mexico 99.49 98.9
United Kingdom 99.58 97.62 Switzerland 98.9 98.38
Germany 99.72 99.4 Denmark 99.71 99.55
France 99.6 98.18 Brazil 99.31 97.69
Netherlands 99.47 99.54 Australia 99.68 98.44
Poland 99.64 99.52 Czechia 99.48 99.03
United States 99.56 97.69 Canada 99.76 99.03
South Korea 99.97 99.99 Russia 98.9 96.97
Spain 99.73 99.4 Finland 99.77 99.76

We have also made a zero-shot evaluation of our models using clean data from 41 other countries; the results are shown in the next table.

Country Fasttext (%) BPEmb (%) Country Fasttext (%) BPEmb (%)
Latvia 89.29 68.31 Faroe Islands 71.22 64.74
Colombia 85.96 68.09 Singapore 86.03 67.19
Réunion 84.3 78.65 Indonesia 62.38 63.04
Japan 36.26 34.97 Portugal 93.09 72.01
Algeria 86.32 70.59 Belgium 93.14 86.06
Malaysia 83.14 89.64 Ukraine 93.34 89.42
Estonia 87.62 70.08 Bangladesh 72.28 65.63
Slovenia 89.01 83.96 Hungary 51.52 37.87
Bermuda 83.19 59.16 Romania 90.04 82.9
Philippines 63.91 57.36 Belarus 93.25 78.59
Bosnia 88.54 67.46 Moldova 89.22 57.48
Lithuania 93.28 69.97 Paraguay 96.02 87.07
Croatia 95.8 81.76 Argentina 81.68 71.2
Ireland 80.16 54.44 Kazakhstan 89.04 76.13
Greece 87.08 38.95 Bulgaria 91.16 65.76
Serbia 92.87 76.79 New Caledonia 94.45 94.46
Sweden 73.13 86.85 Venezuela 79.23 70.88
New Zealand 91.25 75.57 Iceland 83.7 77.09
India 70.3 63.68 Uzbekistan 85.85 70.1
Cyprus 89.64 89.47 Slovakia 78.34 68.96
South Africa 95.68 74.82

Incomplete Data

The following table presents the accuracy on the 20 countries we used during training for both our models but for incomplete data. We didn't test on the other 41 countries since we did not train on them and therefore do not expect to achieve an interesting performance.

Country Fasttext (%) BPEmb (%) Country Fasttext (%) BPEmb (%)
Norway 99.52 99.75 Austria 99.55 98.94
Italy 99.16 98.88 Mexico 97.24 95.93
United Kingdom 97.85 95.2 Switzerland 99.2 99.47
Germany 99.41 99.38 Denmark 97.86 97.9
France 99.51 98.49 Brazil 98.96 97.12
Netherlands 98.74 99.46 Australia 99.34 98.7
Poland 99.43 99.41 Czechia 98.78 98.88
United States 98.49 96.5 Canada 98.96 96.98
South Korea 91.1 99.89 Russia 97.18 96.01
Spain 99.07 98.35 Finland 99.04 99.52

Getting Started:

from deepparse.parser import AddressParser

address_parser = AddressParser(model_type="bpemb", device=0)

# you can parse one address
parsed_address = address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")

# or multiple addresses
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "350 rue des Lilas Ouest Québec Québec G1L 1B6"])

# or multinational addresses
# Canada, US, Germany, UK and South Korea
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "777 Brockton Avenue, Abington MA 2351",
     "Ansgarstr. 4, Wallenhorst, 49134", "221 B Baker Street", "서울특별시 종로구 사직로3길 23"])

# you can also get the probability of the predicted tags
parsed_address = address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6", with_prob=True)

The predictions tags are the following

  • "StreetNumber": for the street number,
  • "StreetName": for the name of the street,
  • "Unit": for the unit (such as apartment),
  • "Municipality": for the municipality,
  • "Province": for the province or local region,
  • "PostalCode": for the postal code,
  • "Orientation": for the street orientation (e.g. west, east),
  • "GeneralDelivery": for other delivery information.

Retrain a Model

see here for a complete example.

# We will retrain the fasttext version of our pretrained model.
address_parser = AddressParser(model_type="fasttext", device=0)

address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8)

Retrain a Model With New Tags

See here for a complete example.

address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}
address_parser.retrain(training_container, 0.8, epochs=1, batch_size=128, prediction_tags=address_components)

Download our Models

Here are the URLs to download our pre-trained models directly


Installation

Before installing deepparse, you must have the latest version of PyTorch in your environment.

  • Install the stable version of deepparse:
pip install deepparse
  • Install the latest development version of deepparse:
pip install -U git+https://github.com/GRAAL-Research/[email protected]

Cite

Use the following for the article;

@misc{yassine2020leveraging,
    title={{Leveraging Subword Embeddings for Multinational Address Parsing}},
    author={Marouane Yassine and David Beauchemin and François Laviolette and Luc Lamontagne},
    year={2020},
    eprint={2006.16152},
    archivePrefix={arXiv}
}

and this one for the package;

@misc{deepparse,
    author = {Marouane Yassine and David Beauchemin},
    title  = {{Deepparse: A State-Of-The-Art Deep Learning Multinational Addresses Parser}},
    year   = {2020},
    note   = {\url{https://deepparse.org}}
}

Contributing to Deepparse

We welcome user input, whether it is regarding bugs found in the library or feature propositions ! Make sure to have a look at our contributing guidelines for more details on this matter.

License

Deepparse is LGPLv3 licensed, as found in the LICENSE file.


Comments
  • [FEATURE] cache handling and offline parsing handling

    [FEATURE] cache handling and offline parsing handling

    Is your feature request related to a problem? Please describe. BPEmbEmbeddingsModel "deepparse/embeddings_models/bpemb_embeddings_model.py" use default "cache_dir" from BPEmb class. This is blocking when we want to overwrite the path of the "cache_dir" which is by default '~/.cache/bpemb'!!

    Describe the solution you'd like I will be better to add an "embeddings_path" param to the BPEmb class instantiation (like it is done for "FastTextEmbeddingsModel").

    The BPEmbEmbeddingsModel init funciton will be for example like :

    def __init__(self, embeddings_path: str, verbose: bool = True) -> None:
        super().__init__(verbose=verbose)
        with warnings.catch_warnings():
            # annoying scipy.sparcetools private module warnings removal
            # annoying boto warnings
            warnings.filterwarnings("ignore")
            model = BPEmb(lang="multi", vs=100000, dim=300, cache_dir=Path(embeddings_path))  # defaults parameters
        self.model = model
    
    enhancement Waiting response 
    opened by fbougares 19
  • Export to ONNX

    Export to ONNX

    Is your feature request related to a problem? Please describe. A script to convert the Address Parser (.ckpt) model to ONNX (.onnx)?

    Describe the solution you'd like Has someone successfully converted the address parser model to onnx format?

    enhancement stale 
    opened by ml5ah 15
  • Pickling error while retraining [BUG]

    Pickling error while retraining [BUG]

    Describe the bug

    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    Input In [4], in <module>
          7 # The path to save our checkpoints
          8 logging_path = "checkpoints"
    ---> 10 address_parser.retrain(training_container, 0.8, epochs=5, batch_size=2, num_workers=1, callbacks=[lr_scheduler], prediction_tags=tag_dictionary, logging_path=logging_path)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\deepparse\parser\address_parser.py:517, in AddressParser.retrain(self, dataset_container, train_ratio, batch_size, epochs, num_workers, learning_rate, callbacks, seed, logging_path, disable_tensorboard, prediction_tags, seq2seq_params)
        511         print(
        512             "You are using a older version of Poutyne that does not support properly error management."
        513             " Due to that, we cannot show retrain progress. To fix that, update Poutyne to "
        514             "the newest version."
        515         )
        516         with_capturing_context = True
    --> 517     train_res = self._retrain(
        518         experiment=exp,
        519         train_generator=train_generator,
        520         valid_generator=valid_generator,
        521         epochs=epochs,
        522         seed=seed,
        523         callbacks=callbacks,
        524         disable_tensorboard=disable_tensorboard,
        525         capturing_context=with_capturing_context,
        526     )
        527 except RuntimeError as error:
        528     list_of_file_path = os.listdir(path=".")
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\deepparse\parser\address_parser.py:849, in AddressParser._retrain(self, experiment, train_generator, valid_generator, epochs, seed, callbacks, disable_tensorboard, capturing_context)
        834 def _retrain(
        835     self,
        836     experiment: Experiment,
       (...)
        846     # If Poutyne 1.7 and before, we capture poutyne print since it print some exception.
        847     # Otherwise, we use a null context manager.
        848     with Capturing() if capturing_context else contextlib.nullcontext():
    --> 849         train_res = experiment.train(
        850             train_generator,
        851             valid_generator=valid_generator,
        852             epochs=epochs,
        853             seed=seed,
        854             callbacks=callbacks,
        855             verbose=self.verbose,
        856             disable_tensorboard=disable_tensorboard,
        857         )
        858     return train_res
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\experiment.py:519, in Experiment.train(self, train_generator, valid_generator, **kwargs)
        471 def train(self, train_generator, valid_generator=None, **kwargs) -> List[Dict]:
        472     """
        473     Trains or finetunes the model on a dataset using a generator. If a previous training already occurred
        474     and lasted a total of `n_previous` epochs, then the model's weights will be set to the last checkpoint and the
       (...)
        517         List of dict containing the history of each epoch.
        518     """
    --> 519     return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\experiment.py:668, in Experiment._train(self, training_func, callbacks, lr_schedulers, keep_only_last_best, save_every_epoch, disable_tensorboard, seed, *args, **kwargs)
        665     expt_callbacks += callbacks
        667 try:
    --> 668     return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs)
        669 finally:
        670     if self.logging:
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\model.py:542, in Model.fit_generator(self, train_generator, valid_generator, epochs, steps_per_epoch, validation_steps, batches_per_step, initial_epoch, verbose, progress_options, callbacks)
        540     self._fit_generator_n_batches_per_step(epoch_iterator, callback_list, batches_per_step)
        541 else:
    --> 542     self._fit_generator_one_batch_per_step(epoch_iterator, callback_list)
        544 return epoch_iterator.epoch_logs
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\model.py:613, in Model._fit_generator_one_batch_per_step(self, epoch_iterator, callback_list)
        611 for train_step_iterator, valid_step_iterator in epoch_iterator:
        612     with self._set_training_mode(True):
    --> 613         for step, (x, y) in train_step_iterator:
        614             step.loss, step.metrics, _ = self._fit_batch(x, y, callback=callback_list, step=step.number)
        615             step.size = self.get_batch_size(x, y)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\iterators.py:73, in StepIterator.__iter__(self)
         71 def __iter__(self):
         72     time_since_last_batch = timeit.default_timer()
    ---> 73     for step, data in _get_step_iterator(self.steps_per_epoch, self.generator):
         74         self.on_batch_begin(step, {})
         76         step_data = Step(step)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\iterators.py:18, in cycle(iterable)
         16 def cycle(iterable):  # Equivalent to itertools cycle, without any extra memory requirement
         17     while True:
    ---> 18         for x in iterable:
         19             yield x
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:359, in DataLoader.__iter__(self)
        357     return self._iterator
        358 else:
    --> 359     return self._get_iterator()
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:305, in DataLoader._get_iterator(self)
        303 else:
        304     self.check_worker_number_rationality()
    --> 305     return _MultiProcessingDataLoaderIter(self)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:918, in _MultiProcessingDataLoaderIter.__init__(self, loader)
        911 w.daemon = True
        912 # NB: Process.start() actually take some time as it needs to
        913 #     start a process and pass the arguments over via a pipe.
        914 #     Therefore, we only add a worker to self._workers list after
        915 #     it started, so that we do not call .join() if program dies
        916 #     before it starts, and __del__ tries to join but will get:
        917 #     AssertionError: can only join a started process.
    --> 918 w.start()
        919 self._index_queues.append(index_queue)
        920 self._workers.append(w)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\process.py:121, in BaseProcess.start(self)
        118 assert not _current_process._config.get('daemon'), \
        119        'daemonic processes are not allowed to have children'
        120 _cleanup()
    --> 121 self._popen = self._Popen(self)
        122 self._sentinel = self._popen.sentinel
        123 # Avoid a refcycle if the target function holds an indirect
        124 # reference to the process object (see bpo-30775)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\context.py:224, in Process._Popen(process_obj)
        222 @staticmethod
        223 def _Popen(process_obj):
    --> 224     return _default_context.get_context().Process._Popen(process_obj)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\context.py:327, in SpawnProcess._Popen(process_obj)
        324 @staticmethod
        325 def _Popen(process_obj):
        326     from .popen_spawn_win32 import Popen
    --> 327     return Popen(process_obj)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\popen_spawn_win32.py:93, in Popen.__init__(self, process_obj)
         91 try:
         92     reduction.dump(prep_data, to_child)
    ---> 93     reduction.dump(process_obj, to_child)
         94 finally:
         95     set_spawning_popen(None)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol)
         58 def dump(obj, file, protocol=None):
         59     '''Replacement for pickle.dump() using ForkingPickler.'''
    ---> 60     ForkingPickler(file, protocol).dump(obj)
    
    OSError: [Errno 22] Invalid argument
    

    To Reproduce I'm trying to train on custom tags on my own data like this -

    lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
    
    
    tag_dictionary = {'STREET_NUMBER': 0, 'STREET_NAME': 1, 'UNSTRUCTURED_STREET_ADDRESS': 2, 'CITY': 3, 'COUNTRY_SUB_ENTITY': 4, 'COUNTRY': 5, 'POSTAL_CODE': 6, 'EOS': 7}
    
    
    logging_path = "checkpoints"
    
    address_parser.retrain(training_container, 0.8, epochs=5, batch_size=2, num_workers=1, callbacks=[lr_scheduler], prediction_tags=tag_dictionary, logging_path=logging_path)
    

    Desktop (please complete the following information):

    • OS: Windows 10
    • Using CPU for training (as dataset is small)
    bug 
    opened by ChargedMonk 15
  • Tag Len DataError Occuring Regardless of Tag Len Matching Address Len

    Tag Len DataError Occuring Regardless of Tag Len Matching Address Len

    I'm trying to retrain a Bpemb model with new address tags, and am using the CSVDatasetContainer function to load the data. I've followed all possible guidelines so it'll read in the data without errors. The training data is two columns with the specific formatting. None of the addresses are empties or single whitespaces, and I've corroborated time and time again that the length of each address is compatible with the length of the tag list. I've done this by tokenizing the original addresses and programmatically comparing their lengths with the lengths of the tag lists from the same row (using a pandas version of the same dataframe). I also dug into the source code and tried the function you guys have listed there (_data_tags_is_same_len_then_address) and when I try it with the pandas version of my df, the output is True, which is supposed to mean that everything is as it should be. I also tried this with PickleDatasetContainer instead, using a .p file with the data formatted as requested, and I get the same error.

    This is how I'm trying to read in the data: CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

    And this is the error I keep getting: image

    System Info:

    • OS: Windows 10
    • IDE: VS Code
    • Python Version: 3.9.12
    • Deepparse Version: 0.7.3
    • Poutyne Version: 1.9 (I used this specific version so I could use the progress bar feature, since there's another issue with the code that compares the float version of Poutyne to 1.8, because the latest version is 1.11 and that is technically a smaller decimal number)

    I'm not 100% sure whether this qualifies as a bug, but it sure is perplexing and I'm not sure where else to ask for help.

    I guess this boils down to:

    • Is there anything about my system that could be causing this?
    • Is it the separator I'm using (without using ',', the function won't read in the data correctly, and its worked with a smaller training set before)
    • Is there any other potential factor I haven't considered?

    Thanks in advance for your help.

    bug 
    opened by joseandrejv 11
  • [BUG] Received

    [BUG] Received "TypeError: can't pickle fasttext_pybind.fasttext objects" when trying to retrain

    Describe the bug

    I was following the retrain instruction on the page, https://deepparse.org/examples/fine_tuning.html and I received the below error messages.

    address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8) Traceback (most recent call last): File "", line 1, in File "C:\Users\janch.conda\envs\py36\lib\site-packages\deepparse\parser\address_parser.py", line 327, in retrain callbacks=callbacks) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 477, in train return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 618, in _train return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 575, in fit_generator self._fit_generator_one_batch_per_step(epoch_iterator, callback_list) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 652, in _fit_generator_one_batch_per_step for step, (x, y) in train_step_iterator: File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 75, in iter for step, data in _get_step_iterator(self.steps_per_epoch, self.generator): File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 19, in cycle for x in iterable: File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 355, in iter return self._get_iterator() File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 914, in init w.start() File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle fasttext_pybind.fasttext objects

    • OS: Windows
    • Python 3.6
    • Running on CPU only
    bug 
    opened by janchanyk 10
  • [RuntimeError] Retrain Error

    [RuntimeError] Retrain Error

    Hi, I got this error when I tried to retrain the model. What could be possible causes?

    RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1

    I used this code setting

    address_parser = AddressParser(model_type="best", device=0)
    lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
    address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])
    

    I have transformed my training data into a pickle file with the right format as the example in the doc; list of tuples ( 'address text', [list of tags corresponding to each word] ). Moreover, I have already made sure that the number of words in a tuple matches the number of elements in its corresponding list.

    opened by jomariya23156 10
  • [BUG] Error during downloading the weights for the network bpemb.

    [BUG] Error during downloading the weights for the network bpemb.

    Hello! It's impossible to download weights for this network. Could you upload this file somewhere else?

    To Reproduce

     address_parser = AddressParser(model_type="bpemb", device=0) 
    

    Full error message:

    /home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py:950: UserWarning: No CUDA device detected, device will be set to 'CPU'.
      warnings.warn("No CUDA device detected, device will be set to 'CPU'.")
    Loading the embeddings model
    /home/dev/.local/lib/python3.10/site-packages/deepparse/network/seq2seq.py:100: UserWarning: No pre-trained model where found in the cache directory /home/dev/.cache/deepparse. Thus, we willautomatically download the pre-trained model.
      warnings.warn(
    Downloading the weights for the network bpemb.
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
        conn = connection.create_connection(
      File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 96, in create_connection
        raise err
      File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 86, in create_connection
        sock.connect(sa)
    TimeoutError: timed out
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
        httplib_response = self._make_request(
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
        self._validate_conn(conn)
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1012, in _validate_conn
        conn.connect()
      File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
        conn = self._new_conn()
      File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 174, in _new_conn
        raise ConnectTimeoutError(
    urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)')
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
        resp = conn.urlopen(
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
        retries = retries.increment(
      File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.ckpt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py", line 237, in __init__
        self._model_factory(
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py", line 1051, in _model_factory
        self.model = BPEmbSeq2SeqModel(
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/network/bpemb_seq2seq.py", line 70, in __init__
        self._load_pre_trained_weights(model_weights_name, cache_dir=cache_dir)
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/network/seq2seq.py", line 104, in _load_pre_trained_weights
        download_weights(model_type, cache_dir, verbose=self.verbose)
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/tools.py", line 109, in download_weights
        download_from_public_repository(model, saving_dir, "ckpt")
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/tools.py", line 92, in download_from_public_repository
        r = requests.get(url, timeout=5)
      File "/usr/lib/python3/dist-packages/requests/api.py", line 76, in get
        return request('get', url, params=params, **kwargs)
      File "/usr/lib/python3/dist-packages/requests/api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "/usr/lib/python3/dist-packages/requests/sessions.py", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/usr/lib/python3/dist-packages/requests/sessions.py", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/usr/lib/python3/dist-packages/requests/adapters.py", line 504, in send
        raise ConnectTimeout(e, request=request)
    requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.ckpt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)'))
    

    Expected behavior Successfully downloaded the weight of this model

    Desktop:

    • OS: Ubuntu 22.04
    • Version: 0.9.1
    enhancement 
    opened by IvanShift 8
  • [Question] Training noisy data from another country?

    [Question] Training noisy data from another country?

    If I have a large dataset with noisy raw addresses and also correctly parsed results for each one, how do I start with training deepparse to get a trained dataset?

    The raw+result data I have is currently in CSV format but with a bit of scripting I can easily transform into another format. I just don't completely understand how to train Deepparse for this.

    enhancement 
    opened by tk512 7
  • [BUG] `SSLError` when downloading model weights of model type: `bpemb`

    [BUG] `SSLError` when downloading model weights of model type: `bpemb`

    Describe the bug

    When trying to use the deepparse.parser.AddressParser class with model_type="bpemb", the model weights download fails due to an SSLError:

    requests.exceptions.SSLError: HTTPSConnectionPool(host='bpemb.h-its.org', port=443): Max retries exceeded with url: /multi/multi.wiki.bpe.vs100000.model (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))
    

    To Reproduce

    Delete model weights cache, most likely ~/.cache/deepparse, and attempt to initialise the class:

    from deepparse.parser import AddressParser
    address_parser = AddressParser(model_type="bpemb", attention_mechanism=False)
    

    Expected behavior

    The model download should not fail.

    Desktop:

    No LSB modules are available.
    Distributor ID: Ubuntu
    Description:    Ubuntu 20.04.3 LTS
    Release:        20.04
    Codename:       focal
    

    I am using deepparse==0.9.1.

    Additional context

    For the moment, I have implemented a dirty fix using a no_ssl_verification (from https://gist.github.com/ChenTanyi/0c47652bd916b61dc196968bca7dad1d) where I initialise the class under this context.

    bug 
    opened by AjinkyaIndulkar 6
  • Use memory mapping when loading embeddings

    Use memory mapping when loading embeddings

    One idea for a future release would be to load the embeddings via memory mapping instead of loading them all into memory.

    For fasttext, it seems that the Fasttext API does not support memory mapping. However, gensim seems to support it but not with the fasttext format. So, either we save the current embeddings in a format readable by memory mapping in the gensim API and we upload them somewhere (GRAIL website server???) or we take embeddings provided by gensim and we retrain a model with them.

    For BPEmb, I haven't checked but it's less bad with regard to memory usage.

    opened by freud14 6
  • Retrain an Address Parser for Single Country Uses

    Retrain an Address Parser for Single Country Uses

    Describe the bug While going through the "Retrain an Address Parser for Single Country Uses" process I was trying to retrain the model for Mexico's only usage and everything was going well until I was testing the address_parser object with the test_container data.

    To Reproduce

    Import the train and test datasets into memory to retrain our parser model

    clean_root_dir = os.path.join(root_dir, "clean_data") clean_train_directory = os.path.join(clean_root_dir, "train") clean_test_directory = os.path.join(clean_root_dir, "test")

    mx_training_data_path = os.path.join(clean_train_directory, "mx.p") mx_test_data_path = os.path.join(clean_test_directory, "mx.p")

    training_container = PickleDatasetContainer(mx_training_data_path) test_container = PickleDatasetContainer(mx_test_data_path)

    address_parser = AddressParser(model_type="fasttext", device=0)

    address_parser.test(test_container, batch_size=256)

    Expected behavior I expected to obtain the test results for the test_container Mexican dataset.

    Screenshots Screen Shot 2022-11-08 at 20 14 09 the problem here.

    Desktop (please complete the following information):

    • OS: macOS Big Sur
    • Version version 11.6
    bug Waiting response 
    opened by tapiatellez 5
  • PO Boxes

    PO Boxes

    Dear friends,

    the paser works well for generic street addresses, but when I've tried to parse a PO Box US address, it fails:

    parsed_address = address_parser("PO Box 40070 Nashville TN 37204")

    [('40070', 'StreetNumber'), ('po box', 'StreetName'), (None, 'Unit'), ('nashville', 'Municipality'), ('tn', 'Province'), ('37204', 'PostalCode'), (None, 'Orientation'), (None, 'GeneralDelivery'), (None, 'EOS')]

    Any plans to improve the training dataset? As far as I remember libpostal works well with PO Boxes and could generate PO Box addresses...

    enhancement stale in progress 
    opened by crtnx 16
Releases(0.9.3)
  • 0.9.3(Nov 24, 2022)

  • 0.9.2(Sep 23, 2022)

    • Improve Deepparse server error handling and error output
    • Remove deprecated argument saving_dir in download_fasttext_magnitude_embeddings and download_fasttext_embeddings functions
    • Add offline argument to remove verification of the latest version
    • Bug-fix cache handling in download model
    • Add download_models CLI function
    • https://github.com/GRAAL-Research/deepparse/issues/156
    Source code(tar.gz)
    Source code(zip)
  • 0.9.1(Aug 19, 2022)

  • 0.9(Aug 19, 2022)

    • Add save_model_weights method to AddressParser to save model weights (PyTorch state dictionary)
    • Improve CI
    • Added verbose flag for the test to activate or deactivate the test verbosity (it overrides the AddressParser verbosity)
    • Add Docker image
    • Add val_dataset to retrain API to allow the use of a specific val dataset for training
    • Remove deprecated download_from_url function
    • Remove deprecated dataset_container argument
    • Fixed error and docs
    • Added the UK retrain example
    Source code(tar.gz)
    Source code(zip)
  • 0.8.3(Aug 19, 2022)

  • 0.8.2(Jul 27, 2022)

    • Bug-fix retrain attention model naming parsing
    • Improve error handling when not a DatasetContainer is used in retrain and test API
    • Add DOI
    Source code(tar.gz)
    Source code(zip)
  • 0.8.1(Jul 26, 2022)

    • Refactored function download_from_url to download_from_public_repository.
    • Add error management when retrain a FastText like model on Windows with a number of workers (num_workers) greater than 0.
    • Improve dev tooling
    • Improve CI
    • Improve code coverage and pylint
    • Add Codacy
    Source code(tar.gz)
    Source code(zip)
  • 0.8(Jul 6, 2022)

    • Improve SEO.
    • Add cache_dir arg in all CLI functions.
    • Improve handling of HTTP error in models version verification.
    • Improve doc.
    • Add a note for parsing data cleaning (i.e. lowercase, commas removal, and hyphen replacing).
    • Add hyphen parsing cleaning step (with a bool flag to activate or not) to improve some country address parsing (see issue 137).
    • Add ListDatasetContainer for Python list dataset.
    Source code(tar.gz)
    Source code(zip)
  • 0.7.6(Jun 9, 2022)

  • 0.7.5(Jun 9, 2022)

    • Bug-fix Poutyne version handling that causes a print error when a version is 1.11 when retraining
    • Add the option to create a named retrain parsing model using by default the architecture setting or using the user-given name
    • Hot-fix missing raise for DataError validation of address to parse when address is tuple
    • Bug-fix handling of string column name for CSVDatasetContainer that raised ValueError
    • Improve parse CLI doc and fix error in doc stating JSON format is supported as input data
    • Add batch_size to parse CLI
    • Add minimum version to Gensim 4.0.0.
    • Add a new CLI function, retrain, to retrain from the command line
    • Improve doc
    • Add cache_dir to the BPEmb embedding model and to AddressParser to change the embeddings cache directory and models weights cache directory
    • Change the saving_dir argument of download_fastext_embeddings and download_fasttext_magnitude_embeddings function to cache_dir. saving_dir is now deprecated and will be remove in version 0.8.
    • Add a new CLI function, test, to test from the command line
    Source code(tar.gz)
    Source code(zip)
  • 0.7.4(May 12, 2022)

    • Improve parsed address print
    • Bug-fix #124: comma-separated list without whitespace in CSVDatasetContainer
    • Add a report when addresses to parse and tags list len differ
    • Add an example on how to fine-tune using our CSVDatasetContainer
    • Improve data validation for data to parse
    Source code(tar.gz)
    Source code(zip)
  • 0.7.3(Apr 8, 2022)

  • 0.7.2(Mar 20, 2022)

  • 0.7.1(Mar 16, 2022)

  • 0.7(Feb 11, 2022)

  • 0.6.7(Feb 10, 2022)

    • Fixed errors in data validation
    • Improved doc over data validation
    • Bugfix data slicing error with data containers
    • Add an example on how to use a retrained model
    Source code(tar.gz)
    Source code(zip)
  • 0.6.6(Feb 9, 2022)

  • 0.6.5(Feb 9, 2022)

    • Improve error handling of empty data and whitespace-only data.
    • Parsing now include two validation on the data quality (not empty and not whitespace only)
    • DataContainer now includes data quality test (not empty, not whitespace only, tags not empty, tag the same len as an address, and data is a list of tuples)
    • New CSVDatasetContainer
    • DataContainer can now be used to predict using a flag.
    • Add a CLI to parse addresses from the command line.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.4(Jan 21, 2022)

  • 0.6.3(Dec 21, 2021)

    Fixed the printing capture to raise the error with Poutyne as of version 1.8. We keep the previous approach as for compatibilities with previous Poutyne version. Added a flag to disable or not Tensorboard during retraining.

    Source code(tar.gz)
    Source code(zip)
  • 0.6.2(Dec 13, 2021)

    • Improved (slightly) code speed of data padding method as per PyTorch list or array to Tensor recommendation.
    • Improved doc for RuntimeError due to retraining FastText and BPEmb model in the same directory.
    • Added error handling RuntimeError when retraining.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Dec 8, 2021)

  • 0.6(Dec 7, 2021)

  • 0.5.1(Nov 1, 2021)

    • Fixed address_comparer hint typing error
    • Fixed some docs errors
    • Retrain and test now have more defaults parameters
    • Various small code and tests improvements
    Source code(tar.gz)
    Source code(zip)
  • 0.5(Oct 21, 2021)

    • Added Python 3.9
    • Added feature to allow a more flexible way to retrain
    • Added a feature to allow retrain of a new seq2seq architecture
    • Fixed prediction tags bug when parsing with new tags after retraining
    Source code(tar.gz)
    Source code(zip)
  • 0.4.4(Oct 4, 2021)

  • 0.4.3(Oct 1, 2021)

  • 0.4.2(Jul 23, 2021)

  • 0.4.1(Jun 15, 2021)

    • Added method to specify the format of address components of a FormattedParsedAddress. Formatting can specify the field separator, the field to be capitalized, and the field to be upper case.
    Source code(tar.gz)
    Source code(zip)
  • 0.4(Jun 9, 2021)

    • Added verbose flag to training and test base on the __init__ of address parser.
    • Added a feature to retrain our models with prediction tags dictionary different from the default one.
    • Added in-doc code examples.
    • Added code examples.
    • Small improvement of our model implementation.
    Source code(tar.gz)
    Source code(zip)
Owner
GRAAL/GRAIL
Machine Learning Research Group - Université Laval
GRAAL/GRAIL
Refactoring dalle-pytorch and taming-transformers for TPU VM

Text-to-Image Translation (DALL-E) for TPU in Pytorch Refactoring Taming Transformers and DALLE-pytorch for TPU VM with Pytorch Lightning Requirements

Kim, Taehoon 61 Nov 07, 2022
Simple ray intersection library similar to coldet - succedeed by libacc

Ray Intersection This project offers a header only acceleration structure library including implementations for a BVH- and KD-Tree. Applications may i

Nils Moehrle 29 Jun 23, 2022
Face Recognition & AI Based Smart Attendance Monitoring System.

In today’s generation, authentication is one of the biggest problems in our society. So, one of the most known techniques used for authentication is h

Sagar Saha 1 Jan 14, 2022
deep learning model that learns to code with drawing in the Processing language

sketchnet sketchnet - processing code generator can we teach a computer to draw pictures with code. We use Processing and java/jruby code paired with

41 Dec 12, 2022
python 93% acc. CNN Dogs Vs Cats ( Pytorch )

English | 简体中文(测试中...敬请期待) Cnn-Classification-Dog-Vs-Cat 猫狗辨别 (pytorch版本) CNN Resnet18 的猫狗分类器,基于ResNet及其变体网路系列,对于一般的图像识别任务表现优异,模型精准度高达93%(小型样本)。 项目制作于

apple ye 1 May 22, 2022
Physics-Informed Neural Networks (PINN) and Deep BSDE Solvers of Differential Equations for Scientific Machine Learning (SciML) accelerated simulation

NeuralPDE NeuralPDE.jl is a solver package which consists of neural network solvers for partial differential equations using scientific machine learni

SciML Open Source Scientific Machine Learning 680 Jan 02, 2023
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
iris - Open Source Photos Platform Powered by PyTorch

Open Source Photos Platform Powered by PyTorch. Submission for PyTorch Annual Hackathon 2021.

Omkar Prabhu 137 Sep 10, 2022
Framework for evaluating ANNS algorithms on billion scale datasets.

Billion-Scale ANN http://big-ann-benchmarks.com/ Install The only prerequisite is Python (tested with 3.6) and Docker. Works with newer versions of Py

Harsha Vardhan Simhadri 132 Dec 24, 2022
Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)

Transfer Learning for Text Classification with Tensorflow Tensorflow implementation of Semi-supervised Sequence Learning(https://arxiv.org/abs/1511.01

DONGJUN LEE 82 Oct 22, 2022
Proto-RL: Reinforcement Learning with Prototypical Representations

Proto-RL: Reinforcement Learning with Prototypical Representations This is a PyTorch implementation of Proto-RL from Reinforcement Learning with Proto

Denis Yarats 74 Dec 06, 2022
(AAAI2022) Style Mixing and Patchwise Prototypical Matching for One-Shot Unsupervised Domain Adaptive Semantic Segmentation

SM-PPM This is a Pytorch implementation of our paper "Style Mixing and Patchwise Prototypical Matching for One-Shot Unsupervised Domain Adaptive Seman

W-zx-Y 10 Dec 07, 2022
A general, feasible, and extensible framework for classification tasks.

Pytorch Classification A general, feasible and extensible framework for 2D image classification. Features Easy to configure (model, hyperparameters) T

Eugene 26 Nov 22, 2022
Some toy examples of score matching algorithms written in PyTorch

toy_gradlogp This repo implements some toy examples of the following score matching algorithms in PyTorch: ssm-vr: sliced score matching with variance

Ending Hsiao 21 Dec 26, 2022
Soomvaar is the repo which 🏩 contains different collection of 👨‍💻🚀code in Python and 💫✨Machine 👬🏼 learning algorithms📗📕 that is made during 📃 my practice and learning of ML and Python✨💥

Soomvaar 📌 Introduction Soomvaar is the collection of various codes implement in machine learning and machine learning algorithms with python on coll

Felix-Ayush 42 Dec 30, 2022
A spatial genome aligner for analyzing multiplexed DNA-FISH imaging data.

jie jie is a spatial genome aligner. This package parses true chromatin imaging signal from noise by aligning signals to a reference DNA polymer model

Bojing Jia 9 Sep 29, 2022
Code release for Convolutional Two-Stream Network Fusion for Video Action Recognition

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer 676 Dec 31, 2022
NAS-Bench-x11 and the Power of Learning Curves

NAS-Bench-x11 NAS-Bench-x11 and the Power of Learning Curves Shen Yan, Colin White, Yash Savani, Frank Hutter. NeurIPS 2021. Surrogate NAS benchmarks

AutoML-Freiburg-Hannover 13 Nov 18, 2022
[ICCV2021] Learning to Track Objects from Unlabeled Videos

Unsupervised Single Object Tracking (USOT) 🌿 Learning to Track Objects from Unlabeled Videos Jilai Zheng, Chao Ma, Houwen Peng and Xiaokang Yang 2021

53 Dec 28, 2022
In this project we predict the forest cover type using the cartographic variables in the training/test datasets.

Kaggle Competition: Forest Cover Type Prediction In this project we predict the forest cover type (the predominant kind of tree cover) using the carto

Marianne Joy Leano 1 Mar 15, 2022