Create a Dataset From a File Asynchronously

Creates a dataset, labels, and examples from the specified .csv, .tsv, or .json file. The call returns immediately and continues to upload data in the background.

Request Parameters

Name

Type

Description

Available Version

data

string

Path to the .csv, .tsv, or .json file on the local drive (FilePart). The maximum file size you can upload from a local drive is 25 MB.

2.0

language

string

Dataset language. Optional. Default is en_US. See Supported Languages.

2.0

name

string

Name of the dataset. Optional. If this parameter is omitted, the dataset name is derived from the file name.

2.0

path

string

URL of the .csv, .tsv, or .json file. The maximum file size you can upload from a web location is 25 MB.

2.0

type

string

Type of dataset data. Valid values are:

  • text-intent
  • text-sentiment

2.0

The API call is asynchronous, so you receive a dataset ID back immediately but the available value is false and the statusMsg value is UPLOADING. Use the dataset ID and make a call to Get a Dataset to query when the upload is complete. When available is true and statusMsg is SUCCEEDED, the data upload is complete, and you can train the dataset to create a model.

Keep the following points in mind when creating datasets.

FILE SIZE

  • The maximum file size you can upload from a local drive or web location is 25 MB.

  • If your file is more than 20 MB, for better performance, we recommend that you upload it to a cloud location that doesn't require authentication and pass the URL in the path parameter.

  • If you have a large amount of data, consider breaking up your data into multiple files. You can load the first file using this call and then load subsequent files using PUT. See Create Examples From a File.

DATSETS

  • The maximum total dataset size is 2 GB.

LABELS

  • The maximum label name length is 180 characters. If a file contains a class label name greater than 180 characters, the label is created in the dataset, but the API truncates the label name to 180 characters.

  • You must have at least two labels in the dataset. If you don't have at least two labels, you can create the dataset but training the dataset fails.

  • A dataset can have up to 500 labels. However, we recommend that a dataset has a maximum of 100 labels for better model accuracy.

EXAMPLES

  • If a file contains duplicate intent or sentiment strings, only the first one is loaded.

  • If the dataset type is text-intent, each label must have at least 20 examples. We recommend that each label has at least 100 examples for better model accuracy. If you have only 20 examples, be sure to use the multilingual-intent or multilingual-intent-ood algorithm when you train the dataset.

  • If the dataset type is text-sentiment, each label must have at least 100 examples.

  • We recommend a maximum of 10,000 examples across all labels.

  • A dataset can have a maximum of 3 million words across all examples. If you try to train a dataset that has more than 3 million words, you receive an error.

  • For best results, we recommend that each example is less than 150 words.

  • If the file contains duplicate text strings (examples), only the first string is uploaded. If there's more than one text string with the same text, only the first string is uploaded and the others are skipped. This is true whether the duplicate text strings have the same label or different labels.

OTHER

  • The Einstein Language APIs support only UTF-8 text characters. If your examples or labels contain any non-UTF-8 text, you receive an error that the file format is invalid when you try to create the dataset.

  • When specifying the URL for a file in a cloud drive service like Dropbox, be sure it's a link to the file and not a link to the interactive download page. For example, the URL should look like https://www.dropbox.com/s/abcdxyz/weather.csv?dl=1

  • If you create a dataset in Apex code, be sure that you reference the URL to the file with https and not http.

Response Body

Name

Type

Description

Available Version

available

boolean

Specifies whether the dataset is ready to be trained.

2.0

createdAt

date

Date and time that the dataset was created.

2.0

id

long

Dataset ID.

2.0

labelSummary

object

Contains the labels array that contains all the labels for the dataset. This is an asynchronous call, so the labels array is empty when you first create a dataset.

2.0

language

string

Dataset language. Default is en_US.

2.0

name

string

Name of the dataset. The API uses the name of the file for the dataset name.

2.0

numOfDuplicates

int

Number of duplicate text strings in the .zip file from which the dataset was created.

2.0

object

string

Object returned; in this case, dataset.

2.0

statusMsg

string

Status of the dataset creation and data upload. Valid values are:

  • FAILED: <message>—Data upload has failed.
  • SUCCEEDED—Data upload is complete.
  • UPLOADING—Data upload is in progress.

2.0

totalExamples

int

Total number of examples in the dataset.

2.0

type

string

Type of dataset data. Valid values are:

  • text-intent
  • text-sentiment

2.0

updatedAt

date

Date and time that the dataset was last updated.

2.0

Each dataset type supports different file formats. This table lists the file formats supported by each dataset type.

text-intent

text-sentiment

.csv file

Y

Y

.tsv file

Y

Y

.json file

Y

N

CSV File Format Considerations

  • You can use a .csv file to create both intent and sentiment datasets.

  • When you create a dataset from a .csv file, the dataset name is inherited from the file name.

  • Each .csv file contains sentiment or intent data in this format: "sentiment or intent string", label-name followed by a CRLF. In the following example, the intent string is in double quotes followed by the label current-weather.

"what's the weather look like",current-weather
"is it raining",current-weather
"what's the temperature",current-weather

TSV File Format Considerations

  • You can use a .tsv file to create both intent and sentiment datasets.

  • When you create a dataset from a .tsv file, the dataset name is inherited from the file name.

  • Each .tsv file contains sentiment or intent data in this format: "sentiment or intent string"<tab_char>label-name followed by a CRLF. In the following example, the intent string is in double quotes followed by the label current-weather.

"what's the weather look like"  current-weather
"is it raining" current-weather
"what's the temperature"    current-weather

JSON File Format Considerations

  • You can use a .json file to create only an intent dataset.

  • The top-level object must be called "intents" as shown in the example file.

  • When you create a dataset from a .json file, the dataset name is inherited from the file name.

  • The JSON structure is a top-level object that contains multiple arrays. Each array contains the intent strings in double quotes. This image shows the structure of the JSON file. The labels are current-weather, five-day-forecast, and hourly-forecast. The content in double quotes becomes a single example that has the specified label.

Language