-
Digitize / image processing
-
Optical character recognition
-
Machine learning / training
-
Build data models
-
Human data review
-
Data integration
Some MTRs are emailed in PDF format as part of the shipment notification for ordered products. However, in many cases only physical copies are available, so they must obviously be scanned. The first step is taking the digital copy of the mill test report and using image processing software to remove all non-text elements.
This means digitally removing all the lines and non-text objects. Humans need this kind document structure to gain an understanding of the data, but for software to “read” the MTR, these lines and objects only get in the way. And if the MTR has been faxed and drug around a dirty shop floor, and then scanned – it’s going to need some digital cleanup!
The second step is to perform optical character recognition (OCR) on the MTR to recognize all the text characters on the page. Modern OCR tools are advanced enough to run multiple OCR “engines” on the document until a desired level of accuracy is achieved. Even if 100% accuracy isn’t obtained, this will be corrected in later steps.
The third step is to use a supervised machine learning algorithm to train a software to recognize that a document is the mill test report. This is important in case additional documents are attached to the MTR. You may or may not want to collect this additional data.
This training may sound daunting, but it really isn’t. Training is as simple as putting a test batch of MTRs into the system and telling it which pages are the MTR. Training might require a dozen examples, but not hundreds or thousands. Once the system learns what an MTR is, you can test it on a much large set of documents to verify it is correctly classifying the documents.
After training has been completed, the fourth step is to build data collection models. Building these models requires an expert who has a deep understanding of what data is on a mill test report, what data needs to be collected, and the different ways information is referenced. For example, Heat #, Heat No., and Heat Number all mean the same thing.
A data collection model must be built for every important piece of information you need to collect from the MTR. If you build all the models you think you need, and later discover that adding a new one would be useful, you can always build the model and re-process your MTRs to extract just that one new data element.
The fifth step is one of the most important: human data review and correction. Remember I mentioned earlier that there may be some characters not read with 100% accuracy by the OCR engines? Here’s where the system is programmed to flag any word or number it isn’t 100% sure about. In a data review screen, the mill test report will be displayed in a visual format so that what the system “read” can be compared with the actual document.
A critical part of human review is to set up the system to automatically search through known information, like Purchase Order Number, Material Grade, and Order Requirements. By “looking up” this information in your database and comparing it to what was found by the automated processing software, you add an additional layer of quality assurance. So, if your purchase order was for a particular material grade but the MTR lists something different, this will also be flagged for human review.
The sixth and final step of the process is to integrate the MTR data and the digital copy of the document(s) with your existing quality software or reporting tool.
Interested in learning more about how to automate the process for reviewing mill test reports? It's only made possible by a new intelligent document processing platform called Grooper.