240 likes | 247 Views
Predict the case status of Labor Condition Application (LCA) for H-1B visa program using machine learning and data mining techniques.
E N D
CS-513-B Knowledge Discovery and Data Mining Final Project Labor Condition Application (LCA) approval prediction
The Team Viveksinh Solanki Gaurang Patel Ronald Fernandes Saumya Shastri
Introduction and Motivation • The labor condition application (LCA) is a form that requires U.S. Department of Labor (DOL) approval for an employer to file an H-1B petition to hire non-immigrant workers under the H-1B visa program • H-1B is a visa in the USA which allows U.S. employers to temporarily employ foreign workers in various occupations. Without approval of LCA , you cannot file an H1B petition • Number of H-1B visas that are issued each year are limited. Hence, it is extremely important for an employer to know the factors which affect LCA approval eventually H-1B approval
Objective • Our main objective for this project is to predict the case status of LCA application submitted by the employer to hire non-immigrant workers under the H-1B visa program. • Employers can hire non-immigrant workers only after their LCA applications is approved. The approved LCA petition is then submitted as part of the Petition for a Non-immigrant Worker Application for work authorization for H-1B visa status • Here we will uncover insights that can help employers understand the process of getting their LCA approved. We have used the machine learning and data mining techniques to understand the relationship between features and the target variable (case status)
Data Set • Source: Kaggle • The H-1B dataset selected for this project contains data from employer’s Labor Condition Application and the case certification determinations processed by the Office of Foreign Labor Certification (OFLC). The date of the determination was issued on or after October 1, 2016 and on or before June 30, 2017 • The dataset contains: • 40 attributes • around 528,147 samples • 4 different classes/labels: WITHDRAWN, CERTIFIED-WITHDRAWN, CERTIFIED and DENIED
Contd.. Median wage for denied and accepted cases
Contd.. NUMBER OF APPLICATIONS R is <1 K O is 1-3 K G is the rest
Contd.. HIGHEST PAYING EMPLOYERS R is <75K O is 75-100K G is the rest
Data Cleaning and Preprocessing • Extracted samples having two labels: CERTIFIED and DENIED • # of samples for CERTIFIED : 468969 • # of samples for DENIED : 6983
Contd.. • Randomly subsampled data from both labels • New count: DENIED = 6983 and CERTIFIED = 6983 • 4 different visa classes in new dataset: • Only kept samples with visa class= ‘H1B’ • Wages values were given in different ranges : “Yearly”, “Monthly”, “Bi-Weekly”, “Weekly”, “Hourly” • Transformed all values to ‘Yearly’
Contd.. • Dropped columns which were irrelevant and had so many missing values • i.e. “CASE_SUBMITTED_MONTH”,” NAICS_CODE”,” PW_SOURCE”,”WAGE_RATE_OF_PAY_TO” etc. • Removed samples with Missing Values (NaN) • New dataset, • # of samples with CERTIFIED : 6843 • # of samples with DENIED: 6322
Contd.. • Added new feature: difference of wage rate of pay and prevailing wage • Converted features having two categories into 1 and 0 • To remove outliers from wages column: • Kept samples having wages < 150000 • Transformed categorical values into one-hot encoded values • Applied min-max normalization to keep all values in single range
Feature Extraction (Random Forest) • Split preprocessed dataset into 70% train and 30% test • Used random forest to extract top 200 features • Final dataset size: • X_train: (8736, 200) • X_test: (3744, 200)
Model - Random Forest • Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set • Accuracy: 79.01% • AUC: 85.93%
Model - Decision Tree • A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility • It is one way to display an algorithm that only contains conditional control statements • Accuracy: 78.60% • AUC: 83.35%
Model - KNN • In pattern recognition, the k-nearest neighbors’ algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space • Accuracy: 79.06% • AUC: 84.59%
Model - Multinomial Naive Bayes • This algorithm estimates the conditional probability of a word/term/token given a class as the relative frequency of term t in documents belonging to class c: • Thus, this variation considers the number of occurrences of term t in training documents from class c, including multiple occurrences. • Accuracy: 76.91% • AUC: 83.50%
Model - SVM • A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples • An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible • Accuracy: 79.52%
Model - Neural Networks • A Neural Network is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain • Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another • Accuracy: 81.10% • AUC: 87.57%
Findings • Providing the top features to the algorithms did not affect the accuracy • Derived feature like difference of prevailing wages proved the most important feature • Random forest gave the best results with an accuracy of 85%
Future Work • As we deal with curse of dimensionality due to many unique Employers, we can try out two things: • Label Encoding the feature • Multiple Correspondence Analysis (MCA)