Text to features for Swedish text

In text mining, texts are usually transformed into numerical vectors or feature vectors, before they are given to a machine learning algorithm for text classification. In this project, a set of features for classifying tweets in Swedish was created. The following classification tasks were selected: gender, age and political party prediction, sentiment analysis and authorship attribution, which is the task of determining if a text was written by a particular author or not. Relevant previous studies were researched and a suitable subset of features used in those studies were chosen. A tool was developed that preprocesses the tweets and calculates, for each tweet, values for the features in the feature set. Experiments were run on a data set consisting of tweets written by Swedish politicians. The output of the tool was given to a machine learning algorithm that created classification models. While the first four classification tasks were unsuccessful, some of the authorship attribution models managed to produce an F-score between 80 and 90%. For the failed classification tasks, the features need to be tested on a different data set or new features have to be created.

DiVA