Xueqi Yang 杨雪琦

I am currently a final year Ph.D. candidate in the RAISE Lab (Real-world Artifical Intelligence for Software Engineering) at North Carolina State University, under the supervision of Dr. Tim Menzies. My research interest includes Software Engineering and optimization.

Before coming to NC State, I obtained my bachelor degree of Information Management and Information System with GPA 90/100 from Dongbei University of Finance and Economics in 2018 .

Email  /  Resume  /  Google Scholar  /  LinkedIn  /  GitHub

Open to Work: I am on the 2024 job market and would appreciate any support. I'm actively looking for an industry position in Machine Learning/Software/Research Engineer, Data Scientist, and Applied Scientist. Should my profile resonate with your requirements, or if you are curious about how I can contribute to your organization, please do not hesitate to get in touch.

Research

I am now interested in machine learning optimization. I'm currently working on Static Warning Identification using incrementally active learning and human-computer interaction to achieve higher recall with lower cost by exploring less irrelevant static warning samples. I'm also dealing with feature extraction from programming language and SE artifacts with embedding methods and NLP approaches. For more information about my research work, please visit my Google Scholar.

Publications
pacman

SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

Xueqi Yang, Mariusz Jakubowski, Kelly Kang, Haojie Yu, Tim Menzies, Under review

As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Extensive experiments carried out on a large-scale dataset for vulnerability detection demonstrate the effectiveness and efficiency of SparseCoder, scaling from quadratically to linearly on long code sequence analysis in comparison to CodeBERT and RoBERTa. We further achieve 50% FLOPs reduction with a negligible performance drop of less than 1% comparing to Transformer leveraging sparse attention. Moverover, SparseCoder goes beyond making "black-box" decisions by elucidating the rationale behind those decisions. Code segments that contribute to the final decision can be highlighted with importance scores, offering an interpretable, transparent analysis tool for the software engineering landscape.

pacman

Learning to recognize actionable static code warnings (is intrinsically easy)

Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, Tim Menzies, Empirical Software Engineering

Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the actionable warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algorithms can find actionable warnings with remarkable ease. We implement deep neural networks in Keras and PyTorch with static defect artifacts to predict real defects to act on. Utilize regularisers to avoid DNN models from overfitting and lower the runnning overhead. Use Box-counting methods to explore the intrinsic dimension of SE data and match the complexity of machine learning algorithms with the datasets it handles.

pacman

Understanding static code warnings: An incremental AI approach

Xueqi Yang, Zhe Yu, Junjie Wang, Tim Menzies, Expert Systems with Applications

Static code analysis is a widely-used method for detecting bugs and security vulnerabilities in software systems. As software becomes more complex, analysis tools also report lists of increasingly complex warnings that developers need to address on a daily basis. Such static code analysis tools are usually over-cautious; i.e. they often offer many warnings about spurious issues. In this paper, we identify actionable static warnings of nine Java projects generated by FindBugs with incrementally active learning and machine learning algorithms to achieve higher recall with lower cost by reducing false alarm. And utilize different sampling approaches (random sampling, uncertainty sampling and certainty sampling) to query warnings suggested by active learning algorithm. Interact the system with human oracle to update the system.

pacman

Simpler Hyperparameter Optimization for Software Analytics: Why, How, When

Amritanshu Agrawal, Xueqi Yang, Rishabh Agrawal, Xipeng Shen and Tim Menzies, IEEE Transactions on Software Engineering

How can we make software analytics simpler and faster One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by DODGE-ing; i.e. simply steering way from settings that lead to similar conclusions. But when is it wise to use that simple approach and when must we use more complex (and much slower) optimizers To answer this, we applied hyperparameter optimization to 120 SE data sets that explored bad smell detection, predicting Github issue close time, bug report analysis, defect prediction, and dozens of other non-SE problems. We find that the simple DODGE works best for data sets with low intrinsic dimensionality (around 3) and very poorly for higher-dimensional data (around 8). Nearly all the SE data seen here was intrinsically low-dimensional, indicating that DODGE is applicable for many SE analytics tasks.

pacman

Corporate and personal credit scoring via fuzzy non-kernel SVM with fuzzy within-class scatter

Jian Luo, Xueqi Yang, Ye Tian, Wenwen Yu, Journal of Industrial & Management Optimization

Nowadays, the effective credit scoring becomes a very crucial factor for gaining competitive advantages in credit market for both customers and corporations. In this paper, we propose a credit scoring method which combines the non-kernel fuzzy 2-norm quadratic surface SVM model, T-test feature weighting strategy and fuzzy within-class scatter together. It is worth pointing out that this new method not only saves computational time by avoiding choosing a kernel and corresponding parameters in the classical SVM models, but also addresses the curse of dimensionality issue and improves the robustness. Besides, we develop an efficient way to calculate the fuzzy membership of each training point by solving a linear programming problem. Finally, we conduct several numerical tests on two benchmark data sets of personal credit and one real-world data set of corporation credit. The numerical results strongly demonstrate that the proposed method outperforms eight state-of-the-art and commonly-used credit scoring methods in terms of accuracy and robustness.

Previous Projects
pacman

Coursework Projects (Graduate):

(1) Utilize Mask R-CNN with PyTorch for satellite images change detection and localization. Assess building damage from satellite imagery with a variety of disaster events and different damage extents.

(2) Implement SmartWeather App in C# with Xamarin and Visual Studio. Use Architecture Diagram, Context Diagram and Quality Attribute Scenarios in software design. Utilize Fuzzy logic controller to converts a crisp input value into a fuzzy set with a predetermined lower and upper bound of impreciseness. And follow the Scrum process to iterate and manage software development.

(3) Implement word2vec (CBOW and Skip-grams) and doc2vec (Doc2vec and Part-of-speech tagging) models in Python 3 on Sentimental Analysis Dataset and Question Answering Dataset. And compare performance of proposed methods with baseline methods (TF-IDF and BOW) in individual projects.

pacman

Undergraduate Projects (2016-2018):

(1) Credit Scoring via Fuzzy 2-norm Non-kernel Support Vector Machine. Finished an algorithm implementation of linear SVM, SVM with kernels, QSVM and clustered SVM with MATLAB based on the UCI data sets.

(2) Quadratic Surface Support Vector Regression for Electric Load Forecasting. Implemented LS-SVR and QSSVR models with the interior point algorithm in the module "quadprog" of MATLAB, and the OLS regression and ANN models with the module "robustfit" and neural network toolbox of MATLAB, respectively.

pacman

Mathematical Contest in Modeling (Undergraduate):

(1) Regional Water Supply Stress Index Analysis before and after Intervention, an analysis to regional water problem in Interdisciplinary Contest in Modeling in 2016.

(2)Allocation of taxi resource in the Internet era, the entry of China Undergraduate Mathematical Contest in Modeling in 2015. I partook problem analysis, data preprocessing, model construction, algorithm implementation and optimization.

Work Experience
pacman

Research Intern in TAP, Postsubmit (Core) in Google, May-July 2023 (Sunnyvale, CA)

Learn and extract features from linguistic description of change lists to Google internal codebase.

Detect breakage and provide high-quality, cost-effective post-submission testing for Google3 (Regression Test Selection and Prioritization).

pacman

Research Intern in Cloud and Infrastructure Security Group in Microsoft Research, May-Aug 2022 (Redmond, WA)

Developed techniques with recent advancements in natural language processing (Longformer, CodeBERT and learned token pruning algorithm) for detecting, mitigating and preventing modern threats in large-scale cloud environments.

Addressed the quadratic memory requirement issue in self-attention mechanism by adapting the global and local attentions on PowerShell dataset.

Courses taken
pacman

Efficient Deep Learning - Fall 2022 (NCSU)

Artificial Intelligence 2 - Spring 2021 (NCSU)

Natural Language Processing, Computational Applied Logic - Fall 2020 (NCSU)

Spatial and Temporal Data Mining, Software Engineering - Spring 2020 (NCSU)

Design and Analysis of Algorithms,Computer Networks - Spring 2019 (NCSU)

Foundations of Software Science, Artificial Intelligence - Fall 2018 (NCSU)

C, Java, Data Ming, Data Structure, JavaScript, Matlab, SQL, Statistics and Operation Research- Undergraduate

Professional Service
pacman

IEEE/ACM International Conference on Automated Software Engineering, 2022 (Reviewer)

IEEE Internet of Things Journal, 2023, 2024 (Reviewer)

Assistantship Experience
pacman

Teaching Assistant - Automated Software Engineering (CSC 591 & 791) - Spring 2024 (NCSU)

Teaching Assistant - Software Engineering (CSC 510) - Fall 2023 (NCSU)

Teaching Assistant - Automated Software Engineering (CSC 591 & 791) - Spring 2023 (NCSU)

Teaching Assistant - C and Software Tools (CSC 230 002 & 601) - Fall 2022 (NCSU)

Graduate Research Assistant - From Spring 2019 to Spring 2022 (NCSU)

Teaching Assistant - C and Software Tools (CSC 230 002 & 601) - Fall 2018 (NCSU)


free counter

Collapse  Message Input Message Xueqi Yang