D4.2 - Automating data capture from natural history specimens

Summary from the Description of Work (17 Jan 2013)
Executive Summary
Introduction
Section 1: Review of development of tools and workflows which incorporate automatic or semiautomatic metadata capture using OCR
- Introduction
- Trial 1: Comparing a range of OCR software tools
  - Materials and Methods
  - Results
  - Discussion
- Trial 2: Comparing OCR tools being used in herbaria
  - Materials and Methods
  - Results
  - Discussion
- Trial 3: Multiple OCR trials of diverse specimens
  - Materials and Methods
  - Results
  - Workflows: Incorporating OCR into digitisation workflows
  - Discussion
Section 2: Review of development of NLP for parsing OCR text into Darwin core fields
- Introduction
- Review
Section 3: Review of (semi) automatic specimen image classification, i.e. (semi) automatic tagging of specimen images from certain collectors or expeditions, using template matching software
- Part 1: Semi-automated Classification of Herbarium Specimens by means of Template Matching Algorithms
- Part 2: Review and trials of Handwritten Text Recognition (HTR)
- Introduction
- Materials and Methods
- Results
- Discussion
Section 4: Review of automatic capture of character including colour, shape as well as EXIF data
- Part 1: Computer vision for specimen classification
  - Summary
  - Tools Used
  - Software Prototypes
  - Specimen segmentation
  - Method
  - Morphological feature detection
  - Calculating physical dimensions
  - Colour analysis
  - Heat maps for regions of interest
  - Dissemination
  - Links
  - References
- Part 2: Correlation of leaf colour and DNA quality
  - Introduction
  - Materials and Methods
  - Results
- References
  - Software and Projects
Appendix 1A: Settings for ABBYY Recognition Server v3 at RBGE
Appendix 1B: Trial 2 - Summary of OCR output for one specimen from each institute63 Appendix 1C: Settings for ABBYY FineReader v12 Professional at RBGK
Appendix 1D: File preparation at RBGK
Appendix 1E: Scores for each specimen from each institute by word
Appendix 1F: OCR Software Results from RBGK testing of different formatting options
Appendix 2: Screenshots of portals using
Appendix 3: Protocol for using Transkribus for natural history collections
- Introduction
  - Step 1: Register and download software
  - Step 2: Log in
  - Step 3: Upload documents to your private collection
  - Step 4: Segment your document into text blocks and baselines
  - Step 5: Manually transcribe a training dataset of 100 pages
  - Step 6: Training the HTR model
  - Step 7: Running the HTR model
Appendix 4: Protocols for sampling and extracting DNA from herbarium specimens at RBGE
- DNA Extraction Methodology: using the QIAGEN automated QIAxtractor

Project element:

D4.2

Report files(s):

Automating data capture from natural history specimens D4.2_0.pdf

Author(s):

Elspeth Haston

Search form

You are here

D4.2 - Automating data capture from natural history specimens