SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

Last update: Nov 24, 2022

Overview

The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service IQON, which was actually in operation for a decade. In addition, the SHIFT15M dataset has several types of dataset shifts, allowing us to evaluate the robustness of the model to different types of shifts (e.g., covariate shift and target shift).

We provide the Datasheet for SHIFT15M. This datasheet is based on the Datasheets for Datasets [1] template.

System	Python 3.6	Python 3.7	Python 3.8
Linux CPU
Linux GPU
Windows CPU / GPU	Status Currently Unavailable	Status Currently Unavailable	Status Currently Unavailable
Mac OS CPU

SHIFT15M is a large-scale dataset based on approximately 15 million items accumulated by the fashion search service IQON.

Installation

From PyPi

$ pip install shift15m

From source

$ git clone https://github.com/st-tech/zozo-shift15m.git
$ cd zozo-shift15m
$ poetry build
$ pip install dist/shift15m-xxxx-py3-none-any.whl

Download SHIFT15M dataset

Use Dataset class

You can download SHIFT15M dataset as follows:

from shift15.datasets import NumLikesRegression

dataset = NumLikesRegression(root="./data", download=True)

Download directly by using download scripts

Please download the dataset as follows:

$ bash scripts/download_all.sh

To avoid downloading the test dataset for set matching (80GB), which is not required in training, you can use the following script.

$ bash scripts/download_all_wo_set_testdata.sh

Tasks

The following tasks are now available:

Tasks	Task type	Shift type	# of input dim	# of output dim
NumLikesRegression	regression	target shift	(N, 25)	(N, 1)
SumPricesRegression	regression	covariate shift, target shift	(N, 1)	(N, 1)
ItemPriceRegression	regression	target shift	(N, 4096)	(N, 1)
ItemCategoryClassification	classification	target shift	(N, 4096)	(N, 7)
Set2SetMatching	set-to-set matching	covariate shift	(N, 4096)x(M, 4096)	(1)

Benchmarks

As templates for numerical experiments on the SHIFT15M dataset, we have published experimental results for each task with several models.

Original Dataset Structure

The original dataset is maintained in json format, and a row consists of the following:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd"
}

Contributing

To learn more about making a contribution to SHIFT15M, please see the following materials:

License

The dataset itself is provided under a CC BY-NC 4.0 license. On the other hand, the software in this repository is provided under the MIT license.

Dataset metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name SHIFT15M Dataset

alternateName SHIFT15M

alternateName shift15m-dataset

url https://github.com/st-tech/zozo-shift15m

sameAs https://github.com/st-tech/zozo-shift15m

description SHIFT15M is a multi-objective, multi-domain dataset which includes multiple dataset shifts.

provider

property	value
name	`ZOZO Research`
sameAs	`https://ja.wikipedia.org/wiki/ZOZO`

license

property	value
name	`CC BY-NC 4.0`
url	`https://github.com/st-tech/zozo-shift15m/blob/main/LICENSE.CC`

Citation

@misc{Kimura_SHIFT15M_Multiobjective_LargeScale_2021,
author = {Kimura, Masanari and Nakamura, Takuma and Saito, Yuki},
month = {8},
title = {SHIFT15M: Multiobjective Large-Scale Fashion Dataset with Distributional Shifts},
year = {2021}
}

Errata

No errata are currently available.

References

[1] Gebru, Timnit, et al. "Datasheets for datasets." arXiv preprint arXiv:1803.09010 (2018).

Comments

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

The following question should be answered:

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).
documentation datasheet

opened by nocotan 3
Extracting Image Features
@nocotan I'm planning to prepare image features as we discussed. To be extracted:

CNN features (2048 dimensional features from the pre-trained Inception-V3 model on ILSVRC2012)

By the way, I was trying to find a properly hand-crafted image feature extractor that involves colors but cannot find available codes. For instance, combining Local Binary Pattern (LBP) and Local Color Contrast (LCC) showed superior performance in a texture classification task described in the following paper compared with other color-based hand-crafted features, but LCC is not in OSS. https://www.researchgate.net/publication/315858786_Hand-Crafted_vs_Learned_Descriptors_for_Color_Texture_Classification

So, here I'm planning not to include a hand-crafted one for the image-based task.
opened by wildsnowman 2
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

The following question should be answered:

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
documentation datasheet

opened by nocotan 2
add LICENSE
Before publishing, we need to determine the license of the repository, e.g.,

MIT

Apache

BSD

GPL

After researching which license is appropriate, please add the LICENSE to the repository.
documentation
opened by nocotan 2
Got an TypeError exception when try to run item category prediction task
Thank you for your great work and dataset opening at first.

Description When I tried to run the item_category_prediction task following the usageitem_category_prediction I got an exception like this:

Environment:

Python 3.8.8

It will be so helpful if you can give any gracious advice, thank you.
bug
opened by you0xy 1
Information: the dataset size

the number of outfits: 2,555,147 the number of images (multiple-counting): 15,218,721 the number of unique images: 2,335,598

Note: maybe shift28M is not the correct name.

opened by wildsnowman 1
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

The following question should be answered:

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
documentation datasheet

opened by nocotan 1
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

The following question should be answered:

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
documentation datasheet

opened by nocotan 1
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

The following question should be answered:

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
documentation datasheet

opened by nocotan 1
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

The following question should be answered:

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
documentation datasheet

opened by nocotan 1
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

The following question should be answered:

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.
documentation datasheet

opened by nocotan 1
Bump setuptools from 65.4.1 to 65.5.1
Bumps setuptools from 65.4.1 to 65.5.1.

Changelog

Sourced from setuptools's changelog.

v65.5.1

Misc ^^^^

#3638: Drop a test dependency on the mock package, always use :external+python:py:mod:unittest.mock -- by :user:hroncok

#3659: Fixed REDoS vector in package_index.

v65.5.0

Changes ^^^^^^^

#3624: Fixed editable install for multi-module/no-package src-layout projects.

#3626: Minor refactorings to support distutils using stdlib logging module.

Documentation changes ^^^^^^^^^^^^^^^^^^^^^

#3419: Updated the example version numbers to be compliant with PEP-440 on the "Specifying Your Project’s Version" page of the user guide.

Misc ^^^^

#3569: Improved information about conflicting entries in the current working directory and editable install (in documentation and as an informational warning).

#3576: Updated version of validate_pyproject.

Commits

a462cb5 Bump version: 65.5.0 → 65.5.1

de35d8b Merge pull request #3656 from bmorris3/typos

58e23de Update changelog. Ref #3659.

43a9c9b Limit the amount of whitespace to search/backtrack. Fixes #3659.

5791343 Add test capturing failed expectation. Ref #3659.

1f97905 ⚫ Fade to black.

6254567 Remove workaround for emacs.

729b180 ⚫ Fade to black.

c068081 Typo corrections

f777a40 Suppress deprecation warning in --rsyncdir. Workaround for #3655.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bug: the number of val/test data is not consistent with other cases when the same years are selected for train_year and test_year.

Describe the bug In set matching, the numbers of data used are restricted as 30816, 3851, and 3851 for train, val, and test data, respectively; however, when the same years are selected for train_year and test_year, it will be inconsistent.

This bug may cause inappropriate experiments in changing train_year and test_year.
bug

opened by wildsnowman 0
disjoint set matching

Parent Task

set matching

Model List

Note

It might be required to conduct set matching experiments under the disjoint setting. Here, we perform testing using the items that are not included while training; we call it disjointed.

References

https://arxiv.org/abs/1804.09979
benchmark

opened by wildsnowman 0
Implementation of the set data loader with tags

Is your feature request related to a problem? Please describe. We added the tags information for our dataset. Then, it is good to implement the additional data loader with tags information.

Describe the solution you'd like This can be accomplished by adding arguments to an existing data loader.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

opened by nocotan 0

Releases(v0.2.0)

v0.2.0(Sep 20, 2022)

add tags info as follows:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd",
  "tags": "tag_a, tag_b, tag_c, ..."
}