Get information about training training data, use cases, features, and other considerations for the optical character recognition (OCR) model.
Salesforce AI Research
- February 2021
- The model doesn’t have numbered versioning. When a fix is sent out, it's deployed everywhere so customers are always running on the latest version.
- Minor changes can occur throughout the release.
- Major changes can occur and are communicated via release notes.
Vision model for character recognition.
Einstein OCR has three modules. The first module detects the text in an image. The second module identifies characters in the text that was identified in the first module, and converts it character by character to a string. The third module uses this computed text to tag the text with an entity such as person, phone number, or address.
The modules use a convolutional neural network (CNN) plus a combination of a CNN and long short-term memory (LSTM) network.
Einstein OCR is available to Salesforce customers that have any SKU for Einstein Vision and Language.
The Einstein OCR model detects alphanumeric text in an image.
The Einstein OCR model is designed for enterprise customers. It's accessed via REST APIs or embedded in a Salesforce application.
- OCR text: Identifies text in an image and returns the relative position of that text.
- Use cases include VIN number detection and serial number detection.
- OCR entity: Identifies text in an image and returns the relative position and the associated entity. To build on OCR text, it identifies text and then identifies the entity that the text represents.
- Use cases include extracting contact information from business cards.
- OCR table: Identifies text in an image and returns the relative position and associated cell number. To build on OCR text, it identifies tables in the text and maps entries to a specific table cell.
- Use cases include digitizing tables such as price sheets.
- The Einstein OCR model currently only supports English. See the Caveats section for details on supported characters.
- The model doesn't support handwritten text, checkboxes, or circled answers.
- The model is optimized for short-form prediction rather than the identification of characters, words, or entities in long-form sentences or paragraphs.
- Certain use cases are prohibited under the Salesforce Acceptable Use Policy.
- A user can't submit data prohibited by the Security, Privacy and Architecture documentation, Sensitive Data section.
The quality of the input can affect the accuracy of the results. These factors can affect text recognition accuracy.
- Scanned image
- Unknown source (internet)
- Low light
- Low resolution
- High skew or perspective
- Varying density of words in the document
- Length of strings
- Font of the text, handwritten text
These factors can affect entity recognition accuracy.
- Addresses in varying formats, for example, differences among countries, apartment numbers, PO boxes
- Birth dates in different formats
- Names from different cultures
- Various genders
Precision and recall are used to evaluate where the text is within an image (the bounding box). Word-level accuracy is used to evaluate character recognition in identified text. The entity F1 score is used to discern the accuracy of entity tags identified for text.
The overall model is also evaluated using an F1 score, calculated based on the precision and recall and the word-level accuracy mentioned above.
The training dataset is composed of data from these open-source libraries:
- Born-Digital Images, found on https://rrc.cvc.uab.es, in the Challenges picklist
- Focused Scene Text, found on https://rrc.cvc.uab.es, in the Challenges picklist
- MLT, found on https://rrc.cvc.uab.es, in the Challenges picklist
- COCO-Text, found on https://rrc.cvc.uab.es, in the Challenges picklist
In addition to the open-source libraries, synthetic datasets were used. Synthetic datasets were created by picking a scene or plain white background and then imposing randomly selected words onto the background and changing certain attributes such as font, character size, or skew.
The entity recognition capabilities use a combination of in-house named-entity recognition (NER) models, the Stanford business card dataset, and an in-house business card dataset.
The data used to evaluate the model was gathered from the sources listed above. The evaluation and training data sets are different subsets of the source data that don’t overlap.
The results reported here are from a publicly available challenge dataset, ICDAR Focused scene text. This dataset consists of real-world scenes such as storefronts and street signs, with images primarily focused on text content. This scenario is typical for text reading applications where the user explicitly directs the camera’s focus on the text content of interest. The challenge in this dataset arises from complex text orientation and complex backgrounds.
Complex text orientation necessitates estimating skew in addition to the location and content of the text. Complex backgrounds can reduce contrast of the text and introduce false positives in the form of random patterns that appear like text.
The ICDAR dataset was designed to encourage the research community to develop solutions for text reading in real-world images, and the challenge allowed for the OCR system to limit the words to a fixed vocabulary (as with the State of the Art (SoTA) benchmark reported below). Such a limitation can be helpful in correcting OCR errors because the words to be detected in the test set are present in the fixed vocabulary. And the search space for a word in a test image decreases from potentially infinite (all possible combinations of alphabets and numbers) to a small number of words in the specified vocabulary.
The SoTA models reported here use a general vocabulary set to aid in text prediction, and are trained on this dataset. But our results don’t use a vocabulary because the models support OCR applications that can’t be captured by a vocabulary, such as vehicle identification number (VIN) or serial number scanning. The models aren't trained on this dataset exclusively. A generic model is used that’s more robust and addresses a wider range of challenges. The F1 score diverges from the SoTA results by just 6.17%.
Evaluation Metric: F1 score
Salesforce Research Model ID:
Fine tuned for the dataset
The text is processed into a string character by character. Because dictionaries aren’t referenced, underlying bias is not an issue. But the supported characters are limited to the Roman alphabet, which could be advantageous to English-language individuals, companies, and organizations.
Certain entities are identified after the text is converted character by character to a string. So the identification of certain entities may be challenging for the model. For example, first or last names from certain regional, ethnic, or religious groups sometimes are recognized with higher accuracy. This issue can include non-binary gender or varied birth date formats.
- Only text that contains characters found in the supported character list is identified. Languages with characters other than those in the supported character list aren’t supported. An increase in the supported characters and languages is planned.
- Only entities in the supported entity list are identified in this version of the model.
- Only phone numbers displayed in the format of the supported phone number formats are identified in this version of the model.
Phone numbers with these country-specific formats are supported.
Updated 2 months ago