Semantic Classification using Regular Expressions

Abstract

Semantic types of data provides useful information about what the data means and thus helps label the data. This also helps solve the problem of dirty data when working on Big Data. A Semantic Classifier can be useful in this scenario. This project creates features for a classifier using Regular Expressions. The Regular Expressions are matched with the data columns to calculate the fraction that it matches with. Fractions for each regular expression work as individual features. These features are then used to train the classifier and predict the semantic types of the data columns. This classifier is then compared to other semantic classifiers generating features using different approaches.

Type
Ashesh Sheth
Ashesh Sheth
MSc Capstone Student
Software Development Engineer at AWS