Create Examples From a Zip File

Adds examples from a .zip file to a dataset. You can use this call only with a dataset that was created from a .zip file.

Request Parameters

Name

Type

Description

Available Version

data

string

Path to the .zip file on the local drive. The maximum file size you can upload from a local drive is 50 MB.

1.0

path

string

URL of the .zip file. The maximum file size you can upload from a web location is 2 GB.

1.0

You must provide the path to the .zip file on either the local machine or in the cloud. This call adds examples to the specified dataset from a .zip file. This is an asynchronous call, so the results that are initially returned contain information for the original dataset and available is false.

Use the dataset ID and make a call to Get a Dataset to query when the upload is complete. When available is true and statusMsg is SUCCEEDED the data upload is complete.

{
  "id": 1000022,
  "name": "mountainvsbeach",
  "createdAt": "2017-02-17T00:22:10.000+0000",
  "updatedAt": "2017-02-17T00:29:56.000+0000",
  "labelSummary": {
    "labels": [
      {
        "id": 1819,
        "datasetId": 1000022,
        "name": "Mountains",
        "numExamples": 150
      },
      {
        "id": 1820,
        "datasetId": 1000022,
        "name": "Beaches",
        "numExamples": 147
      }
    ]
  },
  "totalExamples": 297,
  "totalLabels": 2,
  "available": true,
  "statusMsg": "SUCCEEDED",
  "type": "image",
  "language": "N/A",
  "numOfDuplicates": 0,
  "object": "dataset"
}

Keep the following points in mind when creating examples from a .zip file:

All Datasets

  • If you try to create examples in a dataset while a previous call to create examples is still processing (the dataset's available value is false), the call fails and you receive an error. You must wait until the dataset's available value is true before starting another upload.

  • If the .zip file has an incorrect structure, the API returns an error: FAILED: Invalid zip format provided for <dataset_name>.

  • The maximum total dataset size is 2 GB.

  • The maximum image file name length is 150 characters including the file extension. If the .zip file contains a file with a name greater than 150 characters (including the file extension), the example is created in the dataset but API truncates the example name to 150 characters.

  • Supported image file types are PNG, JPG, and JPEG. If the .zip file contains any unsupported image file types, those images won't be uploaded and no error is returned.

  • If the .zip file contains an image file that has a name containing spaces, the spaces are removed from the file name before the file is uploaded. For example, if you have a file called sandy beach.jpg the example name becomes sandybeach.jpg. If the .zip file contains an image file that has a name with non-ASCII characters, those characters are converted to UTF-8.

  • When specifying the URL for a .zip file in a cloud drive service like Dropbox, be sure it's a link to the file and not a link to the interactive download page. For example, the URL should look like https://www.dropbox.com/s/abcdxyz/mountainvsbeach.zip?dl=1

  • If you create a dataset or upload images from a .zip file in Apex code, be sure that you reference the URL to the file with https and not http.

Image or Image Multi-Label Datasets

  • The .zip file must have a specific directory structure:

  • In the root, there should be a parent directory that contains subdirectories.

  • Each subdirectory below the parent directory becomes a label in the dataset unless the directory name matches a label that's already in the dataset. This subdirectory must contain images to be added to the dataset.

  • Each subdirectory below the parent directory should contain only images and not any nested subdirectories.

  • If the .zip file contains a directory label that's already in the dataset, the API adds the images from that directory to the specified label in the dataset.

  • If the .zip file contains a directory name that isn't a label in the dataset, the API adds a new label (limit is 180 characters).

  • The maximum directory name length is 180 characters. If the .zip file contains a directory with a name greater than 180 characters, the label is created in the dataset, but the API truncates the label name to 180 characters.

  • Image files must be smaller than 1 MB. If the .zip file contains image files larger than 1 MB, the image won't be loaded and no error is returned.

  • Images must be no larger than 2,000 pixels high by 2,000 pixels wide. You can upload images that are larger, but training the dataset might fail.

  • This API call checks for duplicates in the .zip file that contains the new images using these business rules. However, the call doesn't check for duplicates between the .zip file and the images already in the dataset. Duplicate images are handled differently based on the dataset type.

  • Image—For datasets of type image, if there are duplicate image files in the .zip file, only the first file is uploaded. Duplicate images are checked within directories and across directories. If there's more than one image file with the same file contents in the same directory or in multiple directories, only the first file is uploaded and the others are skipped.

  • Multi-label—For datasets of type image-multi-label, if there are duplicate image files in a single directory, only the first file is uploaded and the others are skipped. In a multi-label dataset, it's expected that there are duplicate files across directories. If there's more than one image file with the same file contents in multiple directories, the file is loaded multiple times with a different label.

Object Detection Datasets

  • Here are the guidelines for the .zip file:
  • The .zip file must contain two types of elements: (1) the image files specified in the annotations.csv file and (2) a file named annotations.csv that contains the bounding box data.
  • Images can be in the root of the .zip file or in a folder or folders in the root of the .zip file. If images are in folders more than one level deep, you'll receive an error when you try to create the dataset.
  • The supported image file types are PNG, JPG, and JPEG. If the .zip file contains any unsupported image file types, those images won't be uploaded and no error is returned.
  • The annotations.csv file is a text file that contains the data for the bounding boxes associated with each image. The file must have this exact name.
  • The annotations.csv file can be anywhere within the .zip file.
  • The maximum label name length is 180 characters. If the annotations file contains a label with a name greater than 180 characters, the label is created in the dataset, but the API truncates the label name to 180 characters.

  • Image files must be smaller than 5 MB. If the .zip file contains image files larger than 5 MB, the image won't be loaded and no error is returned.

  • Images must be no larger than 1,600 pixels high by 1,600 pixels wide. You can upload images that are larger, but training the dataset might fail.

  • Labels are case sensitive. If you have labels Oatmeal and oatmeal, they are two distinct labels in the dataset and the resulting model.

  • When you create a dataset, all the images are checked for duplicates. If the .zip file contains multiple image files that have the same contents, only the first of the duplicate files is uploaded. This call also checks for duplicate images between the .zip file and the images already in the dataset. If an image that has the same contents exists in both the .zip file and the dataset, the image in the dataset is replaced with the more recent image from the .zip file.

  • If there's an image in the .zip file, but no bounding box descriptions for that image in the annotations file, the image is dropped and no error is returned.

Annotations.csv File Format

The annotations.csv file contains the bounding box coordinates and the labels for each image.

  1. The first row in the file contains the headers for the CSV values. We use the convention of image_file and boxn, but each header value can be any string.

    • image_file—Header for the image file name.
    • boxn—Header for each bounding box element. The number of boxn values in the header is the maximum number of bounding boxes you can have in an image.
  2. Each row after the header specifies the bounding box descriptions in JSON format for each image in the .zip file. There should be one row per file. Multiple bounding boxes for the same image are listed as separate columns in the same row. The image name provided must be the exact name of the image file included in the parent folder. The x, y, width, and height values specify the bounding box location within the image. The required fields for each bounding box are:

  • label—Classification label for the content in the bounding box.
  • height—Height of the bounding box in pixels.
  • width—Width of the bounding box in pixels.
  • x—Location of the bounding box on the horizontal axis.
  • y—Location of the bounding box on the vertical axis.

Here's an example of an annotations.csv file for two images.

"image_file","box0","box1"
"picture1.jpg","{""label"": ""cat"", ""y"": 242, ""x"": 160, ""height"": 62, ""width"": 428}", "{""label"": ""turtle"", ""y"": 113, ""x"": 61, ""height"": 74, ""width"": 718}"
"picture2.jpg","{""label"": ""dog"", ""y"": 94, ""x"": 27, ""height"": 144, ""width"": 184}","{""label"": ""dog"", ""y"": 50, ""x"": 286, ""height"": 344, ""width"": 348}"

Here's the second image referenced in the annotations.csv file showing the bounding boxes.

Response Body

Name

Type

Description

Available Version

available

boolean

Specifies whether the dataset is ready to be trained.

1.0

createdAt

date

Date and time that the dataset was created.

1.0

id

long

ID of the dataset.

1.0

labelSummary

object

Contains the labels array that contains all the labels for the dataset.

1.0

language

string

Dataset language. Default is N/A.

2.0

name

string

Name of the dataset.

1.0

numOfDuplicates

int

Number of duplicate images. This number includes duplicates in the .zip file from which the dataset was created plus the number of duplicate images from subsequent PUT calls to add images to the dataset.

2.0

object

string

Object returned; in this case, dataset.

1.0

statusMsg

string

Status of the dataset creation and data upload. Valid values are:

  • FAILED: <message>—Data upload has failed.
  • SUCCEEDED—Data upload is complete.
  • UPLOADING—Data upload is in progress.

1.0

totalExamples

int

Total number of examples in the dataset.

1.0

totalLabels

int

Total number of labels in the dataset.

1.0

type

string

Type of dataset data. Valid values are:

  • image
  • image-multi-label—Available in Einstein Vision API version 2.0 and later.

1.0

updatedAt

date

Date and time that the dataset was last updated.

1.0

Label Response Body

Name

Type

Description

Available Version

datasetId

long

ID of the dataset that the label belongs to.

1.0

id

long

ID of the label.

1.0

name

string

Name of the label.

1.0

numExamples

int

Number of examples that have the label.

1.0

Language