D4.2 - Automating data capture from natural history specimens

  • Summary from the Description of Work (17 Jan 2013)
  • Executive Summary
  • Introduction
  • Section 1: Review of development of tools and workflows which incorporate automatic or semiautomatic metadata capture using OCR
    • Introduction
    • Trial 1: Comparing a range of OCR software tools
      • Materials and Methods
      • Results
      • Discussion
    • Trial 2: Comparing OCR tools being used in herbaria
      • Materials and Methods
      • Results
      • Discussion
    • Trial 3: Multiple OCR trials of diverse specimens
      • Materials and Methods
      • Results
      • Workflows: Incorporating OCR into digitisation workflows
      • Discussion
  • Section 2: Review of development of NLP for parsing OCR text into Darwin core fields
    • Introduction
    • Review
  • Section 3: Review of (semi) automatic specimen image classification, i.e. (semi) automatic tagging of specimen images from certain collectors or expeditions, using template matching software
    • Part 1: Semi-automated Classification of Herbarium Specimens by means of Template Matching Algorithms
    • Part 2: Review and trials of Handwritten Text Recognition (HTR)
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
  • Section 4: Review of automatic capture of character including colour, shape as well as EXIF data
    • Part 1: Computer vision for specimen classification
      • Summary
      • Tools Used
      • Software Prototypes
      • Specimen segmentation
      • Method
      • Morphological feature detection
      • Calculating physical dimensions
      • Colour analysis
      • Heat maps for regions of interest
      • Dissemination
      • Links
      • References
    • Part 2: Correlation of leaf colour and DNA quality
      • Introduction
      • Materials and Methods
      • Results
    • References
      • Software and Projects
  • Appendix 1A: Settings for ABBYY Recognition Server v3 at RBGE
  • Appendix 1B: Trial 2 - Summary of OCR output for one specimen from each institute63 Appendix 1C: Settings for ABBYY FineReader v12 Professional at RBGK
  • Appendix 1D: File preparation at RBGK
  • Appendix 1E: Scores for each specimen from each institute by word
  • Appendix 1F: OCR Software Results from RBGK testing of different formatting options
  • Appendix 2: Screenshots of portals using
  • Appendix 3: Protocol for using Transkribus for natural history collections
    • Introduction
      • Step 1: Register and download software
      • Step 2: Log in
      • Step 3: Upload documents to your private collection
      • Step 4: Segment your document into text blocks and baselines
      • Step 5: Manually transcribe a training dataset of 100 pages
      • Step 6: Training the HTR model
      • Step 7: Running the HTR model
  • Appendix 4: Protocols for sampling and extracting DNA from herbarium specimens at RBGE
    • DNA Extraction Methodology: using the QIAGEN automated QIAxtractor
Project element: 
Author(s): 
Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith