Software & Data

Software and Data

We use licenses that do not restrict non-commercial use wherever possible.

dprl GitLab repository -- dprl DockerHub repository

ChemScraper pipeline

(Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Brandon Kirincich et al., 2023-25) Tool for extracting and recognizing molecules in PDF files, generating CDXML (ChemDraw), SMILES, and SVG output. A number of tools used for the online MMLI ReactionMiner search demo are included in this repository.
GitLab repo: https://gitlab.com/dprl/graphics-extraction

UniChemFinder (*Tool and Test Collection)
(Abhisek Dey, 2025) Multi-modal text + chemical diagram search tool with molecular diagram extraction and SMILES conversion. The test collection includes 130 drug patents.
GitHub Repo: https://gitlab.com/dprl/unichemfinder

msg_debug: Message-Oriented Debugging Library for Python

(R. Zanibbi, 2023) Avoids the need to repeatedly add/remove print, input, and assert statements to check values and types, and provides functions to record and report execution times when our program requirements keep changing, and bugs abound. GitLab repo: https://gitlab.com/dprl/msg_debug

ARQMath: Answer Retrieval for Questions on Math

(B. Mansouri et al., 2022) Created over three years for CLEF 2020-2022, this collection contains over 200 topics (i.e., queries with evaluated search results) for both retrieving answers to questions posted in Math Stack Exchange, and retrieving relevant formulas using a formula from question posts as a query (i.e., contextual formula retrieval). In the third year of the task, there was also an open-domain question answering task. Search results from submitted systems were evaluated by undergraduate and graduate students from RIT.

The collection, topics, qrels (query relevance files), and evaluation tools are available from Google Drive (new: local backup copy available here).
ARQMath lab web pages
Task overview paper (CLEF 2022 CEUR Working Notes): pdf

NTCIR-12 MathIR

(R. Zanibbi et al, 2016) An earlier math-aware search collection created for the NTCIR conference. Two collections, one from Wikipedia and one from arXiv documents cut into packages were used for a variety of tasks, including math formula search and keyword + math search.

NTCIR-12 Wikipedia Collection & Corpus description
Additional Data including Search Topics and Relevance Assessments (qrels) (Requires Registration with NTCIR)
NTCIR-12 MathIR web page
Task Overview Paper (NTCIR-12): pdf

CROHME Handwritten Math Formula Recognition Dataset

(H. Mouchère, M. Mahdavi et al., 2019) Collection of over 8000 handwritten training formulas along with multiple test sets (~1k formulas each, one for each year the CROHME competition was run for ICDAR or ICFHR). Data is online: formulas are represented as a series of handwritten 'strokes' (i.e., polylines representing the path of the stylus/touch surface as formulas were drawn). Ground truth is provided in both LaTeX and Presentation MathML. The final year of the competition also included a typeset formula detection sub-task (Typeset Formula Detection (TFD)).

A set of evaluation tools were created for the task over a number of years, and are included in the collection.

CROHME 2019 data set (includes data from all years of the competition + evaluation tools)
CROHME 2019 web page
Task overview paper (ICDAR 2019): pdf
Updated Label Graph evaluation tool (LgEval)
Updated CROHME data tools (CROHMELib)

MathDeck: Math-Aware Search Interface

(Gavin Nishizawa, Yancarlos Diaz, Jennifer Liu, Abishai Dmello, April 2020) MathDeck is a first-of-its-kind math-aware search interface with support for multi-modal formula editing, formula autocompletion, and search using a new 'chip and card' model for representing formulas and their descriptions/user annotations.

Online system
SIGIR 2023 Demonstration
Demonstration video (CHI 2021 Interactivity Demo)
Demonstration video (ECIR 2020 Demo Video)

PHOC: Spatial Retrieval (Pyramidal Histograms of Characters)

(M. Langsenkamp, B. Amador Manrique, 2022-2023)

Tools for creating, visualizing, and indexing PHOC representations for symbols in space. Used for dprl PHOC retrieval models at ARQMath-3 and later.

anyphoc: GitLab page
phocindexing: GitLab page
arqmath-compare: GitLab page -- helpful tool for visualizing relevance rated formulas and search results for ARMQath formula retrieval tasks. Not PHOC-specific, but developed by Matt Langsenkamp while developing this work.

Tangent-CFT and MathFIRE: Embedding-Based Retrieval

(B. Mansouri and Matt Langsenkamp, June 2022) MathFIRE. An OpenSearch retrieval model implementation using faiss for fast retrieval of formulas embedded using TangentCFT. GitLab page

(B. Mansouri, Oct. 2019) TangentCFT. An embedding-based formula search engine. Tangent-CFT embeds representations of formula appearance and semantics in fixed length vectors using fastText. Retrieval is performed using cosine similarity over the vectors. The system obtains very high coarse/partial similarity scores on the NTCIR-12 Formula Browsing Task, and when combined with Approach0 exceeds the state-of-the-art (ICTIR 2019 paper). GitLab page -- **Old** GitHub page

Approach0 Formula Search Engine: Tree-Based Retrieval

(W. Zhong, Jan.2019) A new formula search engine using paths in operator trees (representing operations in a formula), with support for multiple subexpression matches. Released as a companion to Wei's ECIR 2019 paper describing the system. The systems obtains state-of-the-art results for queries without wildcards in the NTCIR-12 Formula Browsing Task.

The ECIR 2019 version of Approach0 is available from GitHub.

AccessMath: Whiteboard Video Summarization and Math Formula Search

(K. Davila, Nov. 2017) Generating keyframe summaries of lecture videos containing only whiteboard contents. The system works with single-shot recordings of lecture videos. Released as a companion to Kenny's ICDAR 2017 paper on the same work. This work was later used to support keyframe-based video navigation, and cross-modal visual math search (for the Tangent-V (visual) search engine; details: K. Davila's PhD dissertation).

Demos: We have one demo for navigating lecture videos using keyframe summaries, and a more detailed demonstration of the Tangent-V visual formula search engine being applied to keyframe summaries. This second demo includes:
- Examples of keyframe video summaries
- Video navigation tool that allows traversal by clicking on 'ink' in keyframes
- Two binary image versions of the video, one with the speaker, and one with the speaker removed. This allows only the whiteboard contents to be viewed throughout the video, for example.
- Visual, cross-modal formula search. Tangent-V can search formulas within video summary keyframes and lecture notes (in LaTeX), as well as between rendered LaTeX and handwritten formula images in generated keyframes.
Source code (analysis + video frame ground truth creation tools): AccessMath_ICDAR_2017_code.zip
Video annotations/data (256MB): AccessMath_ICDAR_2017_data.zip
Original videos: Video Recordings
Extensions (Fall 2018): A newer version of the source code may be found on github. Re-encoded versions of the lecture videos are now available in MP4 format from CUBS at the University at Buffalo.

The Tangent Math Search Engine: Tree-Based Retrieval

Tangent-v (ECIR 2019 version). Visual formula search engine for .png (raster) and .pdf (vector) formula images. Results from our related ECIR 2019 paper are included in the package.
- Tangent-v (ECIR 2019 version)

Tangent-V and AccessMath (July 2018, by K. Davila) This is a generalization of the Tangent formula search engine for more general Visual Search based on appearance alone. This version can be used for indexing and retrieval of binary images. It has been described in our ICFHR 2018 paper. Note: this version also includes code for indexing keyframes in lecture videos that are used for search (from the lab's AccessMath project).
- Tangent-V source code
- Results from ICFHR 2018 paper
Tangent-S (July 2017, by K. Davila, R. Zanibbi, A. Kane and F. Wm. Tompa). This version of the Tangent formula search engine supports individual and parallel search of formula appearance and semantics. This version extends Tangent v. 0.3.1 below, and is described in our SIGIR 2017 paper.
- Tangent-S
Tangent 0.3.1 (May 2016, by K. Davila, R. Zanibbi, A. Kane and F. Wm. Tompa). This is the version described in our NTCIR-12 competition paper, with wildcard support for full subexpressions, and better separation of code for scoring metrics and locating subexpressions with the best match.
- Tangent 0.3.1 (NTCIR-12 release)
Tangent 0.3 (July 2015, by R. Zanibbi, K. Davila, A. Kane and F. Wm. Tompa). You can download the source code and sample results (including .html pages with highlighted hits) below. This is the version described in our SIGIR 2016 paper.

Tangent 0.2 (2014). Nidhin Pattaniyil implemented this extension of the Tangent system to support matrices and prefix subscripts and superscripts. This updated Tangent combines math expression retrieval with a Solr/Lucene text retrieval system, supporting mixed math and text queries.
Please Note: the files below are quite large, in part so that others have a better chance to replicate our results at NTCIR-11 (2014; NTCIR-11 paper)
- Tangent 0.2 Math Expression Retrieval (Source Code)
- Tangent 0.2 Sol/Lucene Modification for Text Search (Source Code)
- Results from NTCIR-11 Math Retrieval Task

Tangent 0.1 demo (2013). A math expression search engine create by David Stalnaker. This online demo searches math expressions in an earlier version of English Wikipedia.
- Source code: GitHub Page

min and FFES: Earlier Math Entry and Search Interfaces

$m_{in}$ multimodal math search interface (2011-2015, demo). Supports mouse/touch, keyboard, mouse and (limited) image input). The program runs on tablets, desktops and laptops.
- Interface source code: GitHub Page
- Source code for recognition and other server applications used: git clone http://saskatoon.cs.rit.edu:10001/root/min-server-apps.git
- The handwritten symbol recognizer used by min is available below.
- The image-based symbol recognizer source code is available from GitHub

Freehand Formula Entry System (FFES) and DRACULAE handwritten math parser (1999 - 2007); early pen-based equation editor (last release: Aug. 10, 2007)

PDF Graphics Extraction Pipeline

(Ayush K. Shah, Abhisek Dey, Matt Langsenkamp, Richard Zanibbi, Sept. 2021)

Combines an improved PDF SymbolScraper tool, ScanSSD-XYc and Yolov4 visual formula detectors, and the QD-GGA math formula visual parser.
Initial version released for the ICDAR 2021 conference; under active development.
GitLab Repository: Graphics Extraction Pipeline

SymbolScraper: PDF symbol extractor

(Matthew Langsenkamp, Oct. 2021) Symbol Scraper Server - a completely rewritten version of SymbolScraper that provides more functionality and is much faster. The new system includes support for Docker containers (GitLab repository)

(Ritvik Joshi et al., April 2019) SymbolScraper is an Apache PDFBox extension (in Java) that provides exact locations and identities for characters and symbols in PDF. The system was used in Kenny Davila and Ritvik Joshi's ECIR 2019 paper on visual formula search in images.

(Old) GitHub repository

Visual Graphics Detectors (SSD + YOLO-based)

(Abhisek Dey, Sept. 2021) ScanSSD-XYc (Scanning Single-Shot Detector with XY cut-based pooling). Detects formulas in document images. A cleaned up and accelerated version of the original ScanSSD first created by Parag Mali. GitLab page

(JP Ramissini, May 2022) Scaled YOLO v4 formula detector, for math formulas and chemical diagrams. Faster, more accurate than ScanSSD for chemical diagrams. GitLab page

(OLD) (Parag Mali, Puneeth Kukkadapu, and Mahshad Mahdavi, Aug. 2019) Sliding window-based SSD detector (github repository) for locating formulas in document images. Data used to train the system is available from the CROHME + TFD 2019 competition web site. Details may be found in Parag's MSc thesis.

LPGA and QD-GGA: Formula Parsing

(Mahshad Mahdavi, Ayush Kumar Shah, Mar 2022) QD-GGA: Query-Drive Global Graph Attention visual parser. Adapts the LPGA model (below) for use with a single multi-task convolutional neural network. CVPR 2020 workshop paper -- GitLab page

(Mahshad Mahdavi, Michael Condon, and Kenny Davila, Sept. 2019) LPGA: Line-of-Sight Parsing with Graph-Based Attention. Graph-based model for formula recognition from online handwritten strokes or connected components in (typeset) formula images. The data set used for our CROHME 2019 paper is available online here: InftyMCCDB-2.

CROHME Handwritten Math Recognition Competitions (web page)

IAPR TC11 CROHME Web Page (datasets and evaluation tools)
CROHME InkML file viewer (web page retired; source code provided with CROHMELib below)
Handwritten math symbol recognizers (source code)

Kenny Davila's System (SVM/Random Forest-based using offline-style features, 2014)
Lei Hu's System (HMM-based using online features, 2011) - was used for $m_{in}$

Complete systems submitted by the dprl (the 'RIT' team) for the competition:
Early tools (2011) developed during the lab's participation in the first CROHME (R. Pospesel and K. Hart)

DPRL Natural Scene Text Detector

(S. Zhu, Apr. 2016) The code below was used to produce the results published in Siyu Zhu's 2016 CVPR paper, A Text Detection System for Natural Scenes with Convolutional Feature Learning and Cascaded Classification, which obtained state-of-the-art results on the ICDAR 2015 Focused Scene Text Detection task at the time of publication.

Static git repository (.zip archive)

US Patent Office (USPTO) Figure and Part Label Detection Competition

Paper co-authored by Chris Riedl (Northeastern, former Harvard post-doc), Marti Hearst (UC Berkeley), Siyu Zhu, Richard Zanibbi and researchers from the Harvard-NASA Tournament Lab (Karim Lakhani et al.) describing an online competition for labeling parts in US patent diagram images has been posted on the arXiv.
The data and source code for the top-5 placing systems in the competition are available through the UCI Machine Learning Repository.

Evaluation Tools

LgEval (updated, Spring 2024): the Label Graph Evaluation library (by R. Zanibbi, H. Mouchere, and Ayush K. Shah). The library uses labeled directed graphs to represent results for structural pattern recognition tasks. GitLab page
CROHMELib (updated May 2020) : translation and file viewing utilities for CROHME InkML/MathML files (by R. Zanibbi and H. Mouchere). GitLab page
Earlier overview for CROHMELib and LgEval is available (from CROHME 2013/2014 version of the tools)
Recognition Strategy Language (version 2.0, implemented in Standard ML; Programmer's Guide to RSL). Ben Holm wrote this code along with a re-implementation of an American Sign Language video interpretation program using OpenCV for his MSc thesis in 2011 (with contributions to the RSL compiler by Matthew Fluet and Richard Zanibbi), and Chris Sasarak made modifications and extensions in 2012-2013. To obtain the source, issue the following commands using git: ( currently unavailable online - contact rxzvcs@rit.edu if you have questions )

Software and Data

We use licenses that do not restrict non-commercial use wherever possible.

dprl GitLab repository -- dprl DockerHub repository

Chemical Informatics

ChemScraper pipeline

Programming Tools

msg_debug: Message-Oriented Debugging Library for Python

Benchmarking Data Sets

ARQMath: Answer Retrieval for Questions on Math

NTCIR-12 MathIR

CROHME Handwritten Math Formula Recognition Dataset

Math-Aware Search

MathDeck: Math-Aware Search Interface

PHOC: Spatial Retrieval (Pyramidal Histograms of Characters)

Tangent-CFT and MathFIRE: Embedding-Based Retrieval

Approach0 Formula Search Engine: Tree-Based Retrieval

AccessMath: Whiteboard Video Summarization and Math Formula Search

The Tangent Math Search Engine: Tree-Based Retrieval

min and FFES: Earlier Math Entry and Search Interfaces

Graphics Recognition & Extraction

PDF Graphics Extraction Pipeline

SymbolScraper: PDF symbol extractor

Visual Graphics Detectors (SSD + YOLO-based)

LPGA and QD-GGA: Formula Parsing

CROHME Handwritten Math Recognition Competitions (web page)

Text Detection & Evaluation Tools

DPRL Natural Scene Text Detector

US Patent Office (USPTO) Figure and Part Label Detection Competition

Evaluation Tools

Address

Contacts

Links

Software and Data

We use licenses that do not restrict non-commercial use wherever possible.dprl GitLab repository -- dprl DockerHub repository

Chemical Informatics

ChemScraper pipeline

Programming Tools

msg_debug: Message-Oriented Debugging Library for Python

Benchmarking Data Sets

ARQMath: Answer Retrieval for Questions on Math

NTCIR-12 MathIR

CROHME Handwritten Math Formula Recognition Dataset

Math-Aware Search

MathDeck: Math-Aware Search Interface

PHOC: Spatial Retrieval (Pyramidal Histograms of Characters)

Tangent-CFT and MathFIRE: Embedding-Based Retrieval

Approach0 Formula Search Engine: Tree-Based Retrieval

AccessMath: Whiteboard Video Summarization and Math Formula Search

The Tangent Math Search Engine: Tree-Based Retrieval

min and FFES: Earlier Math Entry and Search Interfaces

Graphics Recognition & Extraction

PDF Graphics Extraction Pipeline

SymbolScraper: PDF symbol extractor

Visual Graphics Detectors (SSD + YOLO-based)

LPGA and QD-GGA: Formula Parsing

CROHME Handwritten Math Recognition Competitions (web page)

Text Detection & Evaluation Tools

DPRL Natural Scene Text Detector

US Patent Office (USPTO) Figure and Part Label Detection Competition

Evaluation Tools

Address

Contacts

Links

We use licenses that do not restrict non-commercial use wherever possible.

dprl GitLab repository -- dprl DockerHub repository