Research
I am now interested in machine learning optimization. I'm currently working on Static Warning Identification using incrementally active learning and human-computer interaction to achieve higher recall with lower cost by exploring less irrelevant static warning samples. I'm also dealing with feature extraction from programming language and SE artifacts with embedding methods. For more information about my research work, please visit my Google Scholar.
|
 |
Learning to recognize actionable static code warnings (is intrinsically easy)
Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, Tim Menzies, Empirical Software Engineering
Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the actionable warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algorithms can find actionable warnings with remarkable ease. We implement deep neural networks in Keras and PyTorch with static defect artifacts to predict real defects to act on. Utilize regularisers to avoid DNN models from overfitting and lower the runnning overhead. Use Box-counting methods to explore the intrinsic dimension of SE data and match the complexity of machine learning algorithms with the datasets it handles.
|
 |
Understanding static code warnings: An incremental AI approach
Xueqi Yang, Zhe Yu, Junjie Wang, Tim Menzies, Expert Systems with Applications
Static code analysis is a widely-used method for detecting bugs and security vulnerabilities in software systems. As software becomes more complex, analysis tools also report lists of increasingly complex warnings that developers need to address on a daily basis. Such static code analysis tools are usually over-cautious; i.e. they often offer many warnings about spurious issues. In this paper, we identify actionable static warnings of nine Java projects generated by FindBugs with incrementally active learning and machine learning algorithms to achieve higher recall with lower cost by reducing false alarm. And utilize different sampling approaches (random sampling, uncertainty sampling and certainty sampling) to query warnings suggested by active learning algorithm. Interact the system with human oracle to update the system.
|
 |
Simpler Hyperparameter Optimization for Software Analytics: Why, How, When
Amritanshu Agrawal, Xueqi Yang, Rishabh Agrawal, Xipeng Shen and Tim Menzies, IEEE Transactions on Software Engineering
How can we make software analytics simpler and faster One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by DODGE-ing; i.e. simply steering way from settings that lead to similar conclusions. But when is it wise to use that simple approach and when must we use more complex (and much slower) optimizers To answer this, we applied hyperparameter optimization to 120 SE data sets that explored bad smell detection, predicting Github issue close time, bug report analysis, defect prediction, and dozens of other non-SE problems. We find that the simple DODGE works best for data sets with low intrinsic dimensionality (around 3) and very poorly for higher-dimensional data (around 8). Nearly all the SE data seen here was intrinsically low-dimensional, indicating that DODGE is applicable for many SE analytics tasks.
|
 |
Corporate and personal credit scoring via fuzzy non-kernel SVM with fuzzy within-class scatter
Jian Luo, Xueqi Yang, Ye Tian, Wenwen Yu, Journal of Industrial & Management Optimization
Nowadays, the effective credit scoring becomes a very crucial factor for gaining competitive advantages in credit market for both customers and corporations. In this paper, we propose a credit scoring method which combines the non-kernel fuzzy 2-norm quadratic surface SVM model, T-test feature weighting strategy and fuzzy within-class scatter together. It is worth pointing out that this new method not only saves computational time by avoiding choosing a kernel and corresponding parameters in the classical SVM models, but also addresses the curse of dimensionality issue and improves the robustness. Besides, we develop an efficient way to calculate the fuzzy membership of each training point by solving a linear programming problem. Finally, we conduct several numerical tests on two benchmark data sets of personal credit and one real-world data set of corporation credit. The numerical results strongly demonstrate that the proposed method outperforms eight state-of-the-art and commonly-used credit scoring methods in terms of accuracy and robustness.
|
 |
Coursework Projects (Graduate):
(1) Utilize Mask R-CNN with PyTorch for satellite images change detection and localization. Assess building damage from satellite imagery with a variety of disaster events and different damage extents.
(2) Implement SmartWeather App in C# with Xamarin and Visual Studio. Use Architecture Diagram, Context Diagram and Quality Attribute Scenarios in software design. Utilize Fuzzy logic controller to converts a crisp input value into a fuzzy set with a predetermined lower and upper bound of impreciseness. And follow the Scrum process to iterate and manage software development.
(3) Implement word2vec (CBOW and Skip-grams) and doc2vec (Doc2vec and Part-of-speech tagging) models in Python 3 on Sentimental Analysis Dataset and Question Answering Dataset. And compare performance of proposed methods with baseline methods (TF-IDF and BOW) in individual projects.
|
 |
Undergraduate Projects (2016-2018):
(1) Credit Scoring via Fuzzy 2-norm Non-kernel Support Vector Machine. Finished an algorithm implementation of linear SVM, SVM with kernels, QSVM and clustered SVM with MATLAB based on the UCI data sets.
(2) Quadratic Surface Support Vector Regression for Electric Load Forecasting. Implemented LS-SVR and QSSVR models with the interior point algorithm in the module "quadprog" of MATLAB, and the OLS regression and ANN models with the module "robustfit" and neural network toolbox of MATLAB, respectively.
|
 |
Mathematical Contest in Modeling (Undergraduate):
(1) Regional Water Supply Stress Index Analysis before and after Intervention, an analysis to regional water problem in Interdisciplinary Contest in Modeling in 2016.
(2)Allocation of taxi resource in the Internet era, the entry of China Undergraduate Mathematical Contest in Modeling in 2015. I partook problem analysis, data preprocessing, model construction, algorithm implementation and optimization.
|
 |
Research Intern in Cloud and Infrastructure Security Group in Microsoft Research, May-Aug 2022 (Redmond, WA)
Developed techniques with recent advancements in natural language processing (Longformer, CodeBERT and learned token pruning algorithm) for detecting, mitigating and preventing modern threats in large-scale cloud environments.
Addressed the quadratic memory requirement issue in self-attention mechanism by adapting the global and local attentions on PowerShell dataset.
|
 |
Natural Language Processing, Computational Applied Logic - Fall 2020 (NCSU)
Spatial and Temporal Data Mining, Software Engineering - Spring 2020 (NCSU)
Design and Analysis of Algorithms,Computer Networks - Spring 2019 (NCSU)
Foundations of Software Science, Artificial Intelligence - Fall 2018 (NCSU)
C, Java, Data Ming, Data Structure, JavaScript, Matlab, SQL, Statistics and Operation Research- Undergraduate
|
 |
IEEE/ACM International Conference on Automated Software Engineering, 2022 (Reviewer)
|
 |
Teaching Assistant -C and Software Tools (CSC 230 002 & 601) - Fall 2018 (NCSU)
Graduate Research Assistant -From Spring 2019 (NCSU)
|
|