Data Unity Lab

The Data Unity Lab at the Rochester Institute of Technology seeks to develop algorithms, techniques, and tools to improve understanding and simplify analysis of semi-structured data.

Michael Mior

Assistant Professor

Rochester Institute of Technology

Lab Director

Michael Mior is an Assistant Professor in the Data Science cluster at the Rochester Institute of Technology. His research focuses on data integration and understanding for non-relational data. The primary goal of his research is to develop tools and techniques to make diverse data sources easier to analyze.

Research interests

NoSQL Databases
Data Integration
Open Data
Semi-structured Data
Semantic Type Analysis

Education

PhD in Computer Science, 2018
University of Waterloo
MSc in Computer Science, 2011
University of Toronto
BSc in Computing Science, 2009
University of Ontario Institute of Technology

Projects

JSON Schema Inference

Semi-structured data in formats such as JSON is often lacking any explicit schema information which describes the structure and type of the data. This leaves consumers of this data to rely on manual inspection of data when writing data processing code. Our goal is perform data-driven analysis in order to recover a schema that is useful to developers and data analysts in quickly understanding and processing semi-structure data.

Michael Mior, Justin Namba

Semantic Type Analysis

The goal of this project is to assign semantically meaningful types to collections of data values. While types such as strings provide some information, a more useful type for analysts provides additional semantic information. For example, data values such as Kampala, Lima, and Shanghai may be labelled with the semantic type city. This is not a completely new problem, but we aim to consider how the problem of semantic type analysis relates to semistructured data in order to exploit the relationships between nested data values.

Michael Mior, Shuang Wei

Relational Playground

A learning environment for exploring the connection between relational algebra and SQL.

Michael Mior, Aryan Jha, Carson Bloomingdale, Sushruth Beeti

NoSE

Automated schema design for NoSQL databases.

Michael Mior, Yusuke Wakuta

Featured Publications

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya

November, 2025 Information and Software Technology

Data migration for column family database evolution

Context: Database evolution involves processes such as the evolution of the schema, the adaptation of the application to the new schema, and migrations of data to the new or modified structures of the schema. Data migration is particularly crucial in databases where data repetition is common such as the NoSQL column family DBMSs. In these systems, data integrity cannot be enforced from the database side, but instead needs to be maintained from the application side. Database evolution is also affected by data repetition and the absence of data integrity enforcement from the database, as any evolution of the schema requires data migrations to maintain data integrity. Objectives: Ensure data integrity in NoSQL column family DBMSs during database evolution by providing specific instructions for the execution of the necessary data migrations. Methods: We propose MoDEvo, a model-driven engineering approach that provides a data migration model to ensure data integrity for database evolution in column-family DBMSs. This model is then transformed into an executable script that implements the migration procedures. Results: We evaluate MoDEvo by executing data migrations in case studies obtained from open-source projects where the schema evolved. In this evaluation we use Apache Cassandra, the most popular column-family DBMS. Through this evaluation, we verify that the scripts generated from the data migration model effectively maintain data integrity within the database. Conclusion: MoDEvo aids database evolution in column family DBMSs by avoiding the incurrence in the creation of inconsistencies and can also detect impossible migrations, thereby preventing errors. There is still room for improvement such as extending the supported databases to other paradigms where data repetition is common and addressing the evolution of the client applications alongside schema evolution.

Bhavin Oza

May, 2024

Natural Language Query to MongoDB Query

Natural Language Interfaces is an evolving research area, aimed at learning and contextualizing the natural language processing for human computer interaction systems. With the advances in natural language processing (NLP) with machine learning, many significant systems have been built to understand and process human language and provide the necessary output in terms of code or database queries. Few of these systems which are remarkable are based upon works of the Transformer and its attention mechanism. Our project works on one such system, where we convert natural language queries to MongoDB queries.

Koteswara Rao Bade

August, 2023

Minimization Of Large JSON Input For Efficient Debugging

In this project, we present a novel approach to simplify the debugging process for developers working with large JSON lines data. Our solution involves the creation of a program that iteratively reduces the size of the JSON lines file by removing the JSON objects which are not responsible for the error, providing developers with a more manageable subset of the data. By progressively minimizing the input, we aim to improve the efficiency and effectiveness of debugging procedures significantly, ultimately streamlining the development workflow.

Shuang Wei, Michael Mior

July, 2023

Comprehending Semantic Types in JSON Data with Graph Neural Networks

Graph neural networks for semantic type detection in JSON.

Michael Mior

June, 2023 DataEd ‘23

Relational Playground: Teaching the Duality of Relational Algebra and SQL

Relational Playground is a tool for students to explore the connection between relational algebra and SQL.

Michael Mior

June, 2023 SiMoD ‘23

Learning from Uncurated Regular Expressions

Semantic type classification using a large regular expression corpus.

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya

May, 2023 Journal of Systems and Software

CoDEvo: Column family database evolution using model transformations

In recent years, software applications have been working with NoSQL databases as they have emerged to handle big data more efficiently than traditional databases. The data models of these databases are designed to satisfy the requirements of the software application, which means that the models must evolve when the requirements of the software application change. To avoid mistakes during the design and evolution of these NoSQL models, there are several methodologies that recommend using a conceptual model. This implies that consistency between the conceptual model and the schema must be maintained when either evolving the database or the software application. In this work, we propose CoDEvo, a model-driven engineering approach that uses model transformations to address the evolution of a NoSQL column family DBMS schema when the underlying conceptual model evolves due to software requirement changes, aiming to maintain consistency between the schema and conceptual model. We have addressed this problem by defining transformation rules that determine how to evolve the schema for a specific conceptual model change. To validate these transformations, we applied them to conceptual model changes from 9 open-source software applications, comparing the output schemas from CoDEvo with the schemas that were defined in these applications.

Ashesh Sheth

May, 2022

Semantic Classification using Regular Expressions

Semantic types of data provides useful information about what the data means and thus helps label the data. This also helps solve the problem of dirty data when working on Big Data. A Semantic Classifier can be useful in this scenario. This project creates features for a classifier using Regular Expressions. The Regular Expressions are matched with the data columns to calculate the fraction that it matches with. Fractions for each regular expression work as individual features. These features are then used to train the classifier and predict the semantic types of the data columns. This classifier is then compared to other semantic classifiers generating features using different approaches.

Chirag Goel

May, 2022

Semantic JSON Generation

Semantic types describe the information about the entity types and the data those types hold. Detecting semantic types has been a challenge in recent years, and most machine learning models fail to detect semantic types with great accuracy when used against dirty data. These models were generally trained on relational databases, and the testing results of models trained on JSON datasets are still unknown. I introduce a way of creating JSON data files that can be used for training the models that can detect semantic types. I used the sherlock dataset to create JSON data files based on the relationships found amongst the semantic types. The relationships between the semantic types were determined using the ontology mentioned on DBpedia. I was able to find different types of relationships between the semantic types, and based on those relationships I was able to generate Semantic JSON data files. However, I found some anomalies corresponding to some semantic types in the final JSON data files. To evaluate the results, I tracked the anomalies from the sherlock dataset to the source dataset. The source dataset was corrupted at the time sherlock dataset was created.

Goldy Malhotra

January, 2022

Investigating Vector Space Embeddings for Database Schema Management

Text generation in the area of natural language processing as part of the artificial intelligence field has been greatly improving over the last several years. Here we examine the application of vector space word embeddings to provide additional information and context during the text generation process as a way to improve the resultant output through the lens of database normalization. It is known that words encoded into vector space that are closer together in distance generally share meaning or have some semantic or symbolic relationship. This knowledge, paired with the known ability of recurrent neural networks in learning sequences, will be used to examine how vectorizing words can benefit text generation. While the majority of database normalization has been automated, the naming of the generated normalized tables has not. This work seeks to use word embeddings, generated from the data columns of a database table, to give context to a recurrent neural network model while it learns to generate database table names. Using real world data, a recurrent neural network based artificial intelligence model will be paired with a context vector made of word embeddings to observe how effective word embeddings are at providing additional context information during the learning and generation processes. Several methods for generating the context vector will be examined, such as how the word embeddings are generated and how they are combined. The exploration of these methods yielded very promising results in line with the overall goals of the performed work. The benefit of incorporating word embeddings to supply additional information during the text generation process allows for better learning with the goal of generating more human-useful names for newly normalized database tables from their data column titles.

Gautam Gadipudi

December, 2021

Detecting Silent JSON Changes in Dynamic Programming Languages

In this capstone project, we implement a static class in Python3 with static methods to capture details about the operations - frame metadata - performed on JSON data, log this frame metadata, and match it against target frame metadata to discover examples and scenarios of silent JSON errors in Python3 programs.

Angela Bonifati, Michael Mior, Felix Naumann, Nele Sina Noack

December, 2021 SIGMOD Record

How Inclusive are We?

Analysis of gender diversity in database publications.

Michael Mior

November, 2021 Computing Research Repository

Fast Discovery of Nested Dependencies on JSON Data

Discovery of nested data dependencies on semi-structured data sources.

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya

October, 2021 Advances in Conceptual Modeling - ER 2021 Workshops CoMoNoS, EmpER, CMLS, St. John’s, NL, Canada, October 18-21, 2021, Proceedings

An Integrated Approach for Column-Oriented Database Application Evolution Using Conceptual Models

Ammar Alsulami

August, 2021

Empirical Analysis of JSON Schema Use

Coming from the wide adoption of JSON schema, this paper is devoted to investigating the use and characteristics of this technology. We collected, prepared, and analyzed 47,610 json files to draw meaningful conclusions for schema developers. Even with polishing of schema versions, version four is the most commonly used among users and string types outnumbers other types in terms of quantity. The majority of errors while validating schemas is due to using new features while refereeing to an older definition of the schema.

Justin Namba

August, 2021 VLDB PhD Workshop

Enhancing JSON Schema Discovery by Uncovering Hidden Data

Improving JSON schema discovery by disambiguating metadata.

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya

December, 2020 MIDP

Maintaining NoSQL Database Quality During Conceptual Model Evolution

Sagar Khanna

August, 2020

Column prediction using Recurrent Neural Networks

The recent text generation model using character embeddings is an efficient method for learning high-quality distributed vector representation that captures many precise syntactic and semantic character relationships. In this paper, I present an extension that can be applied to a distributed representation of a database column. Using known column names of a table, we train our model to generate new and meaningful column names.

Michael Mior, Ken Q. Pu

August, 2020 IRI

Semantic Data Understanding with Character Level Learning

Convolutional neural networks for semantic data understanding.

Shashank Prabhakar

May, 2019

Semantic JSON Generation

What started as a contribution towards enhancing the per- formance of the query optimizer in Apache Calcite, which is an actively growing open-source framework for building and managing databases, transitioned into optimization of SQL queries in general. To build an intricate analytic system for any emerging real-world big data application, complex queries are needed. These applications demand very high performance and ultra-practical functionality, which can be is to provide a high-level analysis of the system. These complex queries will have a lot of reusable conditional subexpressions. The idea of reusability can be extended even to big data systems where querying happens in batches and data being dealt with will in terabytes and will be continuously growing. These systems are expected to deliver high performance by processing the queries and obtain results quickly. The main idea behind optimizing these queries would be is to how these conditional sub-expressions can be scrutinized and capitalized upon, which will result in efficient big data systems with reusability. In this paper, the idea of reusable conditional sub-expressions is achieved by building directed acyclic graphs for every sub-expression part of the query and inter-linked accordingly. An optimizer which takes in multiple SQL statements as input will provide an efficient way to enhance the performance of the system to which the SQL queries are plugged into.

Michael Mior, Kenneth Salem

October, 2018 ER

Renormalization of NoSQL Database Schemas

Conceptual normalization for NoSQL schemas.

Edmon Begoli, Jesús Camacho-Rodriguez, Julian Hyde, Michael Mior, Daniel Lemire

July, 2017 SIGMOD

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Framework for heterogeneous query processing and optimization.

Michael Mior, Kenneth Salem, Ashraf Aboulnaga, Rui Liu

July, 2017 TKDE

NoSE: Schema Design for NoSQL Applications

Automated schema design for NoSQL databases.

Michael Mior, Kenneth Salem, Ashraf Aboulnaga, Rui Liu

May, 2016 ICDE

NoSE: Schema design for NoSQL applications

Automated schema design for NoSQL databases.

Recent Publications

Quickly discover relevant content by filtering publications.

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya (2025). Data migration for column family database evolution. Information and Software Technology.

PDF Cite DOI

Bhavin Oza (2024). Natural Language Query to MongoDB Query.

PDF Cite

Koteswara Rao Bade (2023). Minimization Of Large JSON Input For Efficient Debugging.

PDF Cite

Shuang Wei, Michael Mior (2023). Comprehending Semantic Types in JSON Data with Graph Neural Networks.

PDF Project

Michael Mior (2023). Relational Playground: Teaching the Duality of Relational Algebra and SQL. DataEd ‘23.

PDF Cite Code Project Slides DOI

Michael Mior (2023). Learning from Uncurated Regular Expressions. SiMoD ‘23.

PDF Cite Code Slides DOI

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya (2023). CoDEvo: Column family database evolution using model transformations. Journal of Systems and Software.

PDF Cite DOI

Ashesh Sheth (2022). Semantic Classification using Regular Expressions.

Cite Project

Chirag Goel (2022). Semantic JSON Generation.

PDF Cite Project Project

Goldy Malhotra (2022). Investigating Vector Space Embeddings for Database Schema Management.

Cite Project

Gautam Gadipudi (2021). Detecting Silent JSON Changes in Dynamic Programming Languages.

PDF Cite Project

Angela Bonifati, Michael Mior, Felix Naumann, Nele Sina Noack (2021). How Inclusive are We?. SIGMOD Record.

Cite

Michael Mior (2021). Fast Discovery of Nested Dependencies on JSON Data. Computing Research Repository.

PDF Cite

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya (2021). An Integrated Approach for Column-Oriented Database Application Evolution Using Conceptual Models. Advances in Conceptual Modeling - ER 2021 Workshops CoMoNoS, EmpER, CMLS, St. John’s, NL, Canada, October 18-21, 2021, Proceedings.

PDF Cite DOI

Ammar Alsulami (2021). Empirical Analysis of JSON Schema Use.

PDF Cite Project

Justin Namba (2021). Enhancing JSON Schema Discovery by Uncovering Hidden Data. VLDB PhD Workshop.

Cite Project Slides

Pablo Suárez-Otero, Michael Mior, María José Suárez Cabal, Javier Tuya (2020). Maintaining NoSQL Database Quality During Conceptual Model Evolution. MIDP.

PDF Cite DOI

Sagar Khanna (2020). Column prediction using Recurrent Neural Networks.

PDF Cite Project

Michael Mior, Ken Q. Pu (2020). Semantic Data Understanding with Character Level Learning. IRI.

Cite Project Slides DOI

Shashank Prabhakar (2019). Semantic JSON Generation.