The Data Unity Lab at the Rochester Institute of Technology seeks to develop algorithms, techniques, and tools to improve understanding and simplify analysis of semi-structured data.
Michael Mior is an Assistant Professor in the Data Science cluster at the Rochester Institute of Technology. His research focuses on data integration and understanding for non-relational data. The primary goal of his research is to develop tools and techniques to make diverse data sources easier to analyze.
PhD in Computer Science, 2018
University of Waterloo
MSc in Computer Science, 2011
University of Toronto
BSc in Computing Science, 2009
University of Ontario Institute of Technology
Semantic type inference
JSON schema discovery
Graph neural networks for semantic type detection in JSON.
Relational Playground is a tool for students to explore the connection between relational algebra and SQL.
In recent years, software applications have been working with NoSQL databases as they have emerged to handle big data more efficiently than traditional databases. The data models of these databases are designed to satisfy the requirements of the software application, which means that the models must evolve when the requirements of the software application change. To avoid mistakes during the design and evolution of these NoSQL models, there are several methodologies that recommend using a conceptual model. This implies that consistency between the conceptual model and the schema must be maintained when either evolving the database or the software application. In this work, we propose CoDEvo, a model-driven engineering approach that uses model transformations to address the evolution of a NoSQL column family DBMS schema when the underlying conceptual model evolves due to software requirement changes, aiming to maintain consistency between the schema and conceptual model. We have addressed this problem by defining transformation rules that determine how to evolve the schema for a specific conceptual model change. To validate these transformations, we applied them to conceptual model changes from 9 open-source software applications, comparing the output schemas from CoDEvo with the schemas that were defined in these applications.
Semantic types of data provides useful information about what the data means and thus helps label the data. This also helps solve the problem of dirty data when working on Big Data. A Semantic Classifier can be useful in this scenario. This project creates features for a classifier using Regular Expressions. The Regular Expressions are matched with the data columns to calculate the fraction that it matches with. Fractions for each regular expression work as individual features. These features are then used to train the classifier and predict the semantic types of the data columns. This classifier is then compared to other semantic classifiers generating features using different approaches.
Semantic types describe the information about the entity types and the data those types hold. Detecting semantic types has been a challenge in recent years, and most machine learning models fail to detect semantic types with great accuracy when used against dirty data. These models were generally trained on relational databases, and the testing results of models trained on JSON datasets are still unknown. I introduce a way of creating JSON data files that can be used for training the models that can detect semantic types. I used the sherlock dataset to create JSON data files based on the relationships found amongst the semantic types. The relationships between the semantic types were determined using the ontology mentioned on DBpedia. I was able to find different types of relationships between the semantic types, and based on those relationships I was able to generate Semantic JSON data files. However, I found some anomalies corresponding to some semantic types in the final JSON data files. To evaluate the results, I tracked the anomalies from the sherlock dataset to the source dataset. The source dataset was corrupted at the time sherlock dataset was created.
Text generation in the area of natural language processing as part of the artificial intelligence field has been greatly improving over the last several years. Here we examine the application of vector space word embeddings to provide additional information and context during the text generation process as a way to improve the resultant output through the lens of database normalization. It is known that words encoded into vector space that are closer together in distance generally share meaning or have some semantic or symbolic relationship. This knowledge, paired with the known ability of recurrent neural networks in learning sequences, will be used to examine how vectorizing words can benefit text generation. While the majority of database normalization has been automated, the naming of the generated normalized tables has not. This work seeks to use word embeddings, generated from the data columns of a database table, to give context to a recurrent neural network model while it learns to generate database table names. Using real world data, a recurrent neural network based artificial intelligence model will be paired with a context vector made of word embeddings to observe how effective word embeddings are at providing additional context information during the learning and generation processes. Several methods for generating the context vector will be examined, such as how the word embeddings are generated and how they are combined. The exploration of these methods yielded very promising results in line with the overall goals of the performed work. The benefit of incorporating word embeddings to supply additional information during the text generation process allows for better learning with the goal of generating more human-useful names for newly normalized database tables from their data column titles.
In this capstone project, we implement a static class in Python3 with static methods to capture details about the operations - frame metadata - performed on JSON data, log this frame metadata, and match it against target frame metadata to discover examples and scenarios of silent JSON errors in Python3 programs.
Discovery of nested data dependencies on semi-structured data sources.
Coming from the wide adoption of JSON schema, this paper is devoted to investigating the use and characteristics of this technology. We collected, prepared, and analyzed 47,610 json files to draw meaningful conclusions for schema developers. Even with polishing of schema versions, version four is the most commonly used among users and string types outnumbers other types in terms of quantity. The majority of errors while validating schemas is due to using new features while refereeing to an older definition of the schema.
Improving JSON schema discovery by disambiguating metadata.
The recent text generation model using character embeddings is an efficient method for learning high-quality distributed vector representation that captures many precise syntactic and semantic character relationships. In this paper, I present an extension that can be applied to a distributed representation of a database column. Using known column names of a table, we train our model to generate new and meaningful column names.
Convolutional neural networks for semantic data understanding.
What started as a contribution towards enhancing the per- formance of the query optimizer in Apache Calcite, which is an actively growing open-source framework for building and managing databases, transitioned into optimization of SQL queries in general. To build an intricate analytic system for any emerging real-world big data application, complex queries are needed. These applications demand very high performance and ultra-practical functionality, which can be is to provide a high-level analysis of the system. These complex queries will have a lot of reusable conditional subexpressions. The idea of reusability can be extended even to big data systems where querying happens in batches and data being dealt with will in terabytes and will be continuously growing. These systems are expected to deliver high performance by processing the queries and obtain results quickly. The main idea behind optimizing these queries would be is to how these conditional sub-expressions can be scrutinized and capitalized upon, which will result in efficient big data systems with reusability. In this paper, the idea of reusable conditional sub-expressions is achieved by building directed acyclic graphs for every sub-expression part of the query and inter-linked accordingly. An optimizer which takes in multiple SQL statements as input will provide an efficient way to enhance the performance of the system to which the SQL queries are plugged into.
Framework for heterogeneous query processing and optimization.