DATA MINING

Reasoning over Paths Using Knowledge Bases and Text Information
  • Constructed a Gene-Drug-Disease dataset based on raw biomedical dataset like UMLS and pubMed.
  • Built a ensemble model combined with LINE and TransE, with LINE for text, and TransE for knowledge graph, to jointly train the embedding.
  • Formed an interpretable method by introducing reinforcement learning models like MINERVA and combined them with conventional text relation extraction methods like PCNN, for the purpose of using text information to compensate for the sparsity of knowledge bases.
  • Achieved state-of-the-art results in knowledge graph completion tasks and contributed to a published research paper in EMNLP 2019.
  • NATURAL LANGUAGE PROCESSING

    Generative And Discriminative Persona Classification Model
  • Led a three-member group in Persona Identification on natural daily conversations
  • Built a basic seq2seq model with speaker information as additional input of decoder, to embed persona information in multi-participant open-domain dialogue systems
  • Extended the basic model to hierarchical version to embed paragraph meaning as well as persona information
  • Explored the identification performance of our proposed model with traditional classifier based on hand-craft features and standard discriminative model using a softmax classifier with LTSM encoder
  • This's a research project with Zhankui He and Xisen Jin in Fudan
  • Report of this project is here
  • Word2Vec and Sentiment Analysis
  • This project is aimed at using word2vec models for sentiment analysis, which can be separated as two subtasks: implementing word2vec model(Skip-gram in this task) to train my own word vectors, and use the average of all the word vectors in each sentence as its feature to train a classifier(e.g. softmax regression) with gradient descent method.
  • During this project, alone with implementing the already well-framed code block, I’ve spent much time improving my code’s efficiency and comparing different implementation methods.
  • Talking about the sentiment analysis, to achieve higher accuracy, I’ve tried different combinations with Context size C, word vector’s dimension dimVectors and REGULARIZATION.
  • In terms of training and testing models, the development set has been divided into training set and dev-test set.
  • Finally, the best accuracy for dev set was achieved at 29.79%.
  • Report of this project is here.
  • Chinese Event Extraction
  • This project is aimed at doing sequence labeling to extract Chinese event using Hidden Markov Models and Conditional Random Field, which can be separated as two subtasks: trigger labeling (for 8 types) and argument labeling (for 35 types).
  • During this project, for reading and saving data, I use libraries like pickle and codecs. In terms of tokenization and tagging Part-Of-Speech for preparation for the CRF toolkit, I choose Jieba. To achieve higher accuracy rate for HMM, I’ve used several smoothing methods, and implemented both bigram and trigram models. Talking about training and testing models, I divided the Development Set into Training Set and Dev-Test Set. Finally, the best accuracy was achieved at 71.65% for argument, 94.68% for trigger with CRF, 96.15% for argument, 71.88% for trigger with HMM.
  • Report of this project is here.
  • Stock Market Prediction
  • This project is aimed at using Text Classification and Sentiment Analysis to process financial news and predict whether the price of a stock will go up or down.
  • For reading and saving data, I use libraries like xlrd, pickle and codecs. In terms of tokenization, I choose Jieba.
  • To achieve higher accuracy rate, I’ve added some financial dictionary to Jieba and removed stop-word from the already tokenized word list. As for extracting features, both positive and negative word dictionary are used and only considering the most common words in news for the purpose of reducing features dimension.
  • Talking about training and testing models, I divided the Development Set into Training Set and Dev-Test Set, and have used cross validation to find the best classifier among Naive Bayes, Decision Tree, Maximum Entropy from nltk and Bernoulli NB, Logistic Regression, SVC, Linear SVC, NuSVC from sklearn. Finally, the best accuracy was achieved at 69.5% with SVM.
  • Report of this project is here.
  • Spell Correction
  • This project is aimed at using doing spell correction using language model and channel model.
  • Selection Mechanism: We choose the candidate with the highest probability.
  • Language Model: P(c) is the probability that c appears as a word of English text. We use Chain Rule and Markov Assumption to compute it.
  • Candidate Model: We use Edit Distance to find which candidate corrections, c, to consider.
  • Channel Model: P(w|c) is The probability that w would be typed in a text when the author meant c.
  • Report of this project is here.
  • 2017 Microsoft Beauty of Programming
  • Built a chat bot with Microsoft's Bot Framework and LUIS platform for Document-based Question Answering task using many extracted intuitive features.
  • Rank 16 among 1200 candidates in nation-wide competition.
  • ARTIFICIAL INTELLIGENCE

    Pac Man Search
  • Adapted from the Berkeley Pac-Man Assignments originally created by John DeNero and Dan Klein.
  • This project is aimed at designing a intelligent Pacman agent that is able to find optimal paths through its maze world considering both reaching particular locations (e.g., finding all the corners) and eating all the dots in as few steps as possible.
  • It can be separated as two subtasks: implementing graph search algorithms for DFS, BFS, UCS as well as A*, and use the search criteria outlined in the lectures to design effective heuristics.
  • Report of this project is here.
  • Gomoku AI Agent
  • Gomoku, also called five in a row, is a board game which originates from Japan. It is a game played on a Go board typically of size 15x15. In Gomoku, players will take turns placing pieces until a player has managed to connect 5 in a row.
  • We're aimed to develop a smart agent for Gomoku game and our team has used 3 main strategies to implement the agent: Threat-Space Search, Minimax Search, Monte Carlo Tree Search(MCTS).
  • COMPUTER VISION

    Authentication of Paintings and Style Transfer
  • Introduced two feature extracting methods (a geometric tight frame with three statistics and a style representation derived from a pre-trained VGG network), applied a forward feature selection algorithm for the authentication task.
  • Implemented VGG network that extracted content and style features, improved it by preprocessing content image, such as contour extraction, edge enhancement for the style transfer task.
  • Report of this project is here. (Notice: This file includes huge pictures, hence it might be a little bit slow to load. Please be patient :)
  • Detailed deep convolutional neural network Other artistic style transfer demo