Source code for Multi-Annotator Supervised LDA for regression (MA-sLDAr) released

MA-sLDAr is a C++ implementation of the supervised topic models with responses/target variables provided by multiple annotators with different levels of expertise, as proposed in:

For more details click here.

Marie Curie COFUND grant

I have been awarded a 2-year H.C. Ørsted Postdoc – COFUND grant. The H.C. Ørsted Postdoc programme is co-funded by Marie Curie Actions. The programme is named after Hans Christian Ørsted, discoverer of electromagnetism and founder of the Denmark Technical University. COFUNDPostdocDTU achieves the goals of Marie Curie COFUND by increasing the European-wide mobility possibilities for training and career development of experienced researchers. More information about the programme can be found here.

Source code for Multi-Annotator Supervised LDA for classification (MA-sLDAc) released

Julia code for LogReg-Crowds released

LogReg-Crowds is a collection of Julia implementations of various approaches for learning a logistic regression model multiple annotators and crowds, namely the works of:

  • Raykar, V., Yu, S., Zhao, L., Valadez, G., Florin, C., Bogoni, L., and Moy, L. Learning from Crowds. Journal of Machine Learning Research, pp. 1297– 1322, 2010.
  •  Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C, 28(1):20–28, 1979.

All implementations are able to handle multi-class problems and do not require repeated labelling (i.e. annotators do not have to provide labels for the entire dataset). The code was though for interpretability and it is well commented, so that it can be very easy to use (kindly see the file “demo.jl”). At the same, the Julia language provides it with a great perfomance, specially when compared to other scientific languages such as MATLAB or Python/Numpy, without compromising its high-level and interpretability.

The tar.gz with the source code can be obtained here.

Source code for GPC-MA released

GPC-MA builds on top of the popular GPML Matlab toolkit for Gaussian processes by giving it the support to handle data from multiple annotators and Crowds, thereby allowing the estimation of the reliability of the different annotators as well as finding better estimates of the (unobserved) ground truth labels when compared to standard GP classification or majority-voting-based approaches. See the original paper for further details:

Rodrigues, F. and Pereira, F.C. and Ribeiro, B., Gaussian Process Classification and Active Learning with Multiple Annotators, in proceedings of the International Conference on Machine Learning (ICML), 2014.

The tar.gz with the source code can be obtained here.

The datasets (from Amazon’s Mechanical Turk) used in the paper are also available here for download.

Gaussian Process Classification and Active Learning with Multiple Annotators

Rodrigues, F. and Pereira, F.C. and Ribeiro, B.
in proceedings of International Conference in Machine Learning (ICML), 2014

Abstract: Learning from multiple annotators took a valuable step towards modeling data that does not fit the usual single annotator setting, since multiple annotators sometimes offer varying degrees of expertise. When disagreements occur, the establishment of the correct label through trivial solutions such as majority voting may not be adequate, since without considering heterogeneity in the annotators, we risk generating a flawed model.
In this paper, we generalize GP classification in order to account for multiple annotators with different levels expertise. By explicitly handling uncertainty, Gaussian processes (GPs) provide a natural framework for building proper multiple-annotator models. We empirically show that our model significantly outperforms other commonly used approaches, such as majority voting, without a significant increase in the computational cost of approximate Bayesian inference. Furthermore, an active learning methodology is proposed, which is able to reduce annotation cost even further.

Read more

Source code for CRF-MA released

CRF-MA is an extension of the Java implementation of Conditional Random Fields (CRFs) available in the Mallet toolbox in order to handle multiple annotators. CRF-MA uses the Expectation-Maximization algorithm to jointly learn the CRF model parameters, the relia- bility of the annotators and the estimated ground truth. When it comes to performance, the proposed method (CRF-MA) significantly outperforms typical approaches such as majority voting. See the original paper for further details:

Rodrigues, F. and Pereira, F.C. and Ribeiro, B., Sequence labeling with multiple annotators, Machine Learning, Springer, 2013.

Download here.

Sequence labeling with multiple annotators

Rodrigues, F. and Pereira, F.C. and Ribeiro, B.
Machine Learning, Springer, 2013

Abstract: The increasingly popular use of Crowdsourcing as a resource to obtain labeled data has been contributing to the wide awareness of the machine learning community to the problem of supervised learning from multiple annotators. Several approaches have been proposed to deal with this issue, but they disregard sequence labeling problems. However, these are very common, for example, among the Natural Language Processing and Bioinformatics communities. In this paper, we present a probabilistic approach for sequence labeling using Conditional Random Fields (CRF) for situations where label sequences from multiple annotators are available but there is no actual ground truth. The approach uses the Expectation-Maximization algorithm to jointly learn the CRF model parameters, the reliability of the annotators and the estimated ground truth. When it comes to performance, the proposed method (CRF-MA) significantly outperforms typical approaches such as majority voting.

Read more

Learning from Multiple Annotators: Distinguishing Good from Random Labelers

Rodrigues, F. and Pereira, F.C. and Ribeiro, B.
Pattern Recognition Letters, Elsevier, 2013

Abstract: With the increasing popularity of online crowdsourcing platforms such as Amazon Mechanical Turk (AMT), building supervised learning models for datasets with multiple annotators is receiving an increasing attention from researchers. These platforms provide an inexpensive and accessible resource that can be used to obtain labeled data, and in many situations the quality of the labels competes directly with those of experts. For such reasons, much attention has recently been given to annotator-aware models. In this paper, we propose a new probabilistic model for supervised learning with multiple annotators where the reliability of the di?erent annotators is treated as a latent variable. We empirically show that this model is able to achieve state of the art performance, while reducing the number of model parameters, thus avoiding a potential overfitting. Furthermore, the proposed model is easier to implement and extend to other classes of learning problems such as sequence labeling tasks.

Read more

Text analysis in incident duration prediction

Pereira, F.C. and Rodrigues, F. and Ben-Akiva, M.
Transportation Research Part C, Elsevier, 2013

Abstract: Due to their heterogeneous case-by-case nature, plenty of relevant information about traffic incidents is communicated in free flow text fields instead of constrained value fields. As a result, such text components enclose considerable richness that is invaluable for incident analysis, modeling and prediction. However, the difficulty to formally interpret such data has led to minimal consideration in previous work.

This paper proposes the use of topic modeling, a text analysis technique, in the problem of incident duration prediction. We analyze a dataset of 2 years of accident cases and develop a duration prediction model that considers both textual and non-textual features. To demonstrate the value of the approach, we compare predictions with and without text analysis using several different prediction models.

Read more