In one of our previous blogs we talked about how we use a convolutional neural network for invoice recognition. At IxorThink, we have kept improving this model: adding recognition fields, adding pattern recognition for addresses, classification for customer and supplier information etc.
We recently updated our recognition API to make sure clients can send feedback. This makes it possible to track if fields were changed by the user after detection (which is most probably because the user corrected a recognition error). Using this feedback to “feed back” into the model may seem straightforward, but it needs to be handled with care:
First of all, correctly labeling documents for training takes a lot of time and resources. We simply cannot use all this feedback data for training because of its high volume. Secondly, the training dataset needs to stay curated and balanced; if the same invoice template has considerably more examples in the dataset, the trained model will be skewed and overfitting would become more likely. The solution is to carefully select which invoices are needed for training. Preferably, we want to select documents which have a template that is not contained in our dataset. This is the best way to keep our dataset balanced.
Computing if a document has lookalikes in our dataset is not that easy: documents can be scanned, rotated and can contain different languages. To compute the layout-similarity of two invoice documents, we introduced invoice-embeddings.
An embedding is a low-dimensional space into which you translate complex data points. This makes it easier to use as an input for machine learning, or to capture underlying semantics. In most cases, an embedding places similar inputs close to each other in embedding space. This concept is used a lot in NLP, e.g. to translate words to a vector space (Word2Vec). In this case we want to translate invoice documents to a low dimensional embedding space, where invoice of the same supplier/template are close together. This means we can use a distance metric (like cosine or euclidean distance) to measure if two invoice are alike.
To learn the mapping from invoice to embedding space, we use triplet learning. The main idea is to train the neural network with groups of three samples: one anchor, one positive and one negative. The anchor and the positive sample are invoices of the same template and we train the network in a way that they are close to each other in the embedding space. At the same time, we make sure that the anchor and the negative are far away from each other. You can find more about this technique in one of our previous blogs.
Currently the embedding model is up and running in an AWS lambda function. It returns a hash for every feedback document that is returned by our clients. By calculating the distance to the files in our dataset, we determine if a template is already in our dataset and how many times it is already in there. The image below is a screenshot from our feedback system interface. The “Tag Count” column demonstrates our embedding based distance measuring in practice.