Create a Dataset From a Zip File Synchronously

Creates a dataset, labels, and examples from the specified .zip file. The call returns after the dataset is created and all of the images are uploaded. Use this API call for .zip files that are smaller than 10 MB.

Request Parameters

Name

Type

Description

Available Version

data

string

Path to the .zip file on the local drive (FilePart). The maximum .zip file size you can upload from a local drive is 50 MB.

1.0

language

string

Dataset language. Optional. Default is N/A. Reserved for future use.

2.0

name

string

Name of the dataset. Optional. If this parameter is omitted, the dataset name is derived from the .zip file name. If this parameter is omitted, the dataset name returned by this call is dataset.

1.0

path

string

URL of the .zip file. The maximum .zip file size you can upload from a web location is 50 MB.

1.0

type

string

Type of dataset data. Valid values are:

  • image: Classification dataset.
  • image-detection—Object detection dataset. Available in Einstein Vision API version 2.0 and later.
  • image-multi-label—Multi-label classification dataset. Available in Einstein Vision API version 2.0 and later.

1.0

The API call is synchronous, so results are returned after the data has been uploaded to the dataset. If this call succeeds, it returns the labels array, available is true, and statusMsg is SUCCEEDED.

You must provide the path to the .zip file on either the local machine or in the cloud.

If the dataset type is image or image-multi-label, this API:

  • Creates a dataset that has the same name as the .zip file (limit is 100 characters), if the name parameter is omitted.
  • Creates a label for each directory in the .zip file. The label name is the same name as the directory name (limit is 180 characters).
  • Creates an example for each image file in each directory in the .zip file. The example name is the same as the image file name.

If the dataset type is image-detection, this API:

  • Creates a dataset that has the same name as the .zip file (limit is 100 characters), if the name parameter is omitted.
  • Creates a label for each unique label in the annotations.csv file (limit is 180 characters).
  • Creates an example for each image file in the .zip file.

Keep the following points in mind when creating datasets.

All Datasets

  • If your .zip file is more than 10 MB, we recommend that you use the asynchronous call to create a dataset. If you use this call with a large dataset .zip file, the call could time out. See Create a Dataset From a Zip File Asynchronously.

  • The maximum .zip file size you can upload from a local drive or a web location is 50 MB.

  • The maximum total dataset size is 2 GB. After you create the dataset, you can use the PUT call to add more examples to it.

  • If the name parameter is passed, the maximum length is 100 characters.

  • The maximum image file name length is 150 characters including the file extension. If the .zip file contains a file with a name greater than 150 characters (including the file extension), the example is created in the dataset, but the API truncates the example name to 150 characters.

  • If the .zip file contains an image file that has a name containing spaces, the spaces are removed from the file name before the file is uploaded. For example, if you have a file called sandy beach.jpg the example name becomes sandybeach.jpg. If the .zip file contains an image file that has a name with non-ASCII characters, those characters are converted to UTF-8.

  • When specifying the URL for a .zip file in a cloud drive service like Dropbox, be sure it's a link to the file and not a link to the interactive download page. For example, the URL should look like https://www.dropbox.com/s/abcdxyz/mountainvsbeach.zip?dl=1

  • If the .zip file has an incorrect structure, the API returns an error: FAILED: Invalid zip format provided for <dataset_name>.

  • If you create a dataset or upload images from a .zip file in Apex code, be sure that you reference the URL to the file with https and not http.

Image or Image Multi-Label Datasets

  • The .zip file must have a specific directory structure:

  • In the root, there should be a parent directory that contains subdirectories.

  • Each subdirectory below the parent directory becomes a label in the dataset. This subdirectory must contain images to be added to the dataset.

  • Each subdirectory below the parent directory should contain only images and not any nested subdirectories.

  • If you have a large amount of data (gigabytes), you might want to break up your data into multiple .zip files. You can load the first .zip file using this call and then load subsequent .zip files using PUT. See Create Examples From Zip File.

  • If you create a dataset from a .zip file, you can only add examples to it from a .zip file using PUT. See Create Examples From Zip File. You can't add a single example from a file.

  • The maximum directory name length is 180 characters. If the .zip file contains a directory with a name greater than 180 characters, the label is created in the dataset, but the API truncates the label name to 180 characters.

  • The minimum number of labels is two. You can create a image classification dataset with only one label, but the dataset training will fail and return an error.

  • The minimum number of examples per label is 10.

  • The minimum number of total examples across all labels is 40.

  • Image files must be smaller than 1 MB. If the .zip file contains image files larger than 1 MB, the image won't be loaded and no error is returned.

  • Images must be no larger than 2,000 pixels high by 2,000 pixels wide. You can upload images that are larger, but training the dataset might fail.

  • The supported image file types are PNG, JPG, and JPEG. If the .zip file contains any unsupported image file types, those images won't be uploaded and no error is returned.

  • Duplicate images are handled differently based on the dataset type.

  • Image—For datasets of type image, if there are duplicate image files in the .zip file, only the first file is uploaded. Duplicate images are checked within directories and across directories. If there's more than one image file with the same file contents in the same directory or in multiple directories, only the first file is uploaded and the others are skipped.

  • Multi-label—For datasets of type image-multi-label, if there are duplicate image files in a single directory, only the first file is uploaded and the others are skipped. In a multi-label dataset, it's expected that there are duplicate files across directories. If there's more than one image file with the same file contents in multiple directories, the file is loaded multiple times with a different label.

  • You can download an example image .zip file from https://einstein.ai/images/mountainvsbeach.zip.

Object Detection Datasets

  • Here are the guidelines for the .zip file:
  • The .zip file must contain two types of elements: (1) the image files specified in the annotations.csv file and (2) a file named annotations.csv that contains the bounding box data.
  • Images can be in the root of the .zip file or in a folder or folders in the root of the .zip file. If images are in folders more than one level deep, you'll receive an error when you try to create the dataset.
  • The supported image file types are PNG, JPG, and JPEG. If the .zip file contains any unsupported image file types, those images won't be uploaded and no error is returned.
  • The annotations.csv file is a text file that contains the data for the bounding boxes associated with each image. The file must have this exact name.
  • The annotations.csv file can be anywhere within the .zip file.
535535
  • The maximum label name length is 180 characters. If the annotations file contains a label with a name greater than 180 characters, the label is created in the dataset, but the API truncates the label name to 180 characters.

  • Image files must be smaller than 5 MB. If the .zip file contains image files larger than 5 MB, the image won't be loaded and no error is returned.

  • Labels are case sensitive. If you have labels Oatmeal and oatmeal, they are two distinct labels in the dataset and the resulting model.

  • The minimum number of labels is one.

  • When you create a dataset, all the images are checked for duplicates. If the .zip file contains multiple image files that have the same contents, only the first of the duplicate files is uploaded.

  • If there's an image in the .zip file, but no bounding box descriptions for that image in the annotations file, the image is dropped and no error is returned.

  • You can download an example object detection .zip file from https://einstein.ai/images/alpine.zip.

Annotations.csv File Format

The annotations.csv file contains the bounding box coordinates and the labels for each image.

  1. The first row in the file contains the headers for the CSV values. We use the convention of image_file and boxn, but each header value can be any string.

    • image_file—Header for the image file name.
    • boxn—Header for each bounding box element. The number of boxn values in the header is the maximum number of bounding boxes you can have in an image.
  2. Each row after the header specifies the bounding box descriptions in JSON format for each image in the .zip file. There should be one row per file. Multiple bounding boxes for the same image are listed as separate columns in the same row. The image name provided must be the exact name of the image file included in the parent folder. The x, y, width, and height values specify the bounding box location within the image. The following table lists the required fields for each bounding box.

Name

Type

Description

label

string

Classification label for the content in the bounding box.

height

int

Height of the bounding box in pixels.

width

int

Width of the bounding box in pixels.

x

int

Location of the bounding box on the horizontal axis.

y

int

Location of the bounding box on the vertical axis.

Here's an example of an annotations.csv file for two images.

"image_file","box0","box1"
"picture1.jpg","{""label"": ""cat"", ""y"": 242, ""x"": 160, ""height"": 62, ""width"": 428}", "{""label"": ""turtle"", ""y"": 113, ""x"": 61, ""height"": 74, ""width"": 718}"
"picture2.jpg","{""label"": ""dog"", ""y"": 94, ""x"": 27, ""height"": 144, ""width"": 184}","{""label"": ""dog"", ""y"": 50, ""x"": 286, ""height"": 344, ""width"": 348}"

Here's the second image referenced in the annotations.csv file showing the bounding boxes.

480480

Response Body

Name

Type

Description

Available Version

available

boolean

Specifies whether the dataset is ready to be trained.

1.0

createdAt

date

Date and time that the dataset was created.

1.0

id

long

Dataset ID.

1.0

labelSummary

object

Contains the labels array that contains all the labels for the dataset.

1.0

language

string

Dataset language. Default is N/A.

2.0

name

string

Name of the dataset. The API uses the name of the .zip file for the dataset name.

1.0

numOfDuplicates

int

Number of duplicate images in the .zip file from which the dataset was created.

2.0

object

string

Object returned; in this case, dataset.

1.0

statusMsg

string

Status of the dataset creation and data upload. Valid values are:

  • FAILED: <message>—Data upload has failed.
  • SUCCEEDED—Data upload is complete.
  • UPLOADING—Data upload is in progress.

1.0

totalExamples

int

Total number of examples in the dataset.

1.0

type

string

Type of dataset data. Valid values are:

  • image—Classification dataset.
  • image-detection—Object detection dataset. Available in Einstein Vision API version 2.0 and later.
  • image-multi-label—Multi-label classification dataset. Available in Einstein Vision API version 2.0 and later.

1.0

updatedAt

date

Date and time that the dataset was last updated.

1.0

Labels Response Body

Name

Type

Description

Available Version

datasetId

long

ID of the dataset that the label belongs to.

1.0

id

long

ID of the label.

1.0

name

string

Name of the label.

1.0

numExamples

int

Number of examples that have the label.

1.0

Language