# Workshop on Recent Trends in Machine Learning and Data Mining

Thursday 22nd September, 2011

**Sala d'Actes de la FIB**

*Conference Hall of the Barcelona School of Informatics*on floor 0 of building B6 Campus Nord UPC.

Contacts: Albert Bifet (abifet AT cs.waikato.ac.nz), Ricard Gavaldà (gavalda AT lsi.upc.edu)

**09:25 09:30**Welcome

**09:30 10:00**Joao Gama (U. Porto) Distributed clustering from data streams

**10:00 10:30**Eibe Frank (U. Waikato) Experiments with Randomisation and Boosting for Multi-instance Classification

**10:30 11:00**Gemma C. Garriga (INRIA Lille) A discussion on sampling graphs to approximate network classification functions

**11:00 12:00**-Break; discussions and networking encouraged

**12:00 12:30**Ariadna Quattoni (UPC Barcelona) Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

**12:30 01:00**Indrė Žliobaitė (U. Bornemouth) Adaptive pre-processing for streaming data

**01:00 01:30**Geoff Holmes (U. Waikato) Machine Learning Application Development

**01:30 03:00**-Lunch (on your own)

**03:00 03:30**Anton Dries (U. Pompeu Fabra) A query language for analyzing networks

**03:30 04:00**Bernhard Pfahringer (U. Waikato) Semi-random model tree ensembles: an effective and scalable

regression method

**04:00 04:30**Aris Gionis (Yahoo! Research Labs) Overlapping correlation clustering

**04:30 05:00**Discussion and more networking

**Distributed clustering from data streams**

Joao Gama (U. Porto)

Simple objects that surround us are gaining sensors, computational power, and actuators, and are changing from static, into adaptive and reactive systems. In this talk we discuss issues for knowledge discovery from distributed data streams generated by sensors with limited computational resources.

We present two clustering algorithms for two different tasks: clustering streaming data, which searches for dense regions of the data space, and clustering streaming data sources, which finds groups of sources that behave similarly over time. In the ﬁrst setting, a cluster is deﬁned to be a set of data points. In the second setting, a cluster is deﬁned to be a set of sensors. We conclude the talk by presenting the lessons learned.

**Experiments with Randomisation and Boosting for Multi-instance Classification**

Eibe Frank (U. Waikato)

A fairly recent development in the WEKA software has been the addition of algorithms for multi-instance classification, in particular, methods for ensemble learning. Ensemble classification is a well-known approach for obtaining highly accurate classifiers for single-instance data. This talk will first discuss how randomisation can be applied to multi-instance data by adapting Blockeel et al.'s multi-instance tree inducer to form an ensemble classifier, and then investigate how Maron's diverse density learning method can be used as a weak classifier to form an ensemble using boosting. Experimental results show the benefit of ensemble learning in both cases.

**A discussion on sampling graphs to approximate network classification functions**

Gemma C. Garriga (INRIA Lille)

The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.

In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.

**Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing**

Ariadna Quattoni (UPC Barcelona)

**Adaptive pre-processing for streaming data**

Indrė Žliobaitė (U. Bornemouth)

Many supervised learning approaches that adapt to changes in data distribution over time (e.g. concept drift) have been developed. The majority of them assume that data comes already pre-processed or that pre-processing is an integral part of a learning algorithm. In real application tasks data that comes from, e.g. sensor readings, is typically noisy, contains missing values, redundant features and a large part of model training needs to be devoted to data cleaning and pre-processing. As data is evolving over time, not only learning models, but also pre-processing mechanisms need to adapt. We will discuss under what circumstances it is beneficial to handle adaptivity of pre-processing and adaptivity of the learning model separately.

**Machine Learning Application Development**

Geoff Holmes (U. Waikato)

In this talk I will review several real-world applications and tools developed at the University of Waikato over the past 15 years. The early applications focused on agricultural problems such as cow culling, venison bruising and grass grubs. Following this we looked at the use of near infrared spectroscopy coupled with data mining as an alternate laboratory technique for predicting compound concentrations in soil and plant samples. Our latest application is in the area of gas chromatography mass spectrometry (GCMS), a technique used to determine in environmental applications, for example, the petroleum content in soil and water samples.

**A query language for analyzing networks**

Anton Dries (U. Pompeu Fabra)

Information networks are a popular way to represent information, especially in domains where the emphasis lies on the structural relationships between the entities rather than their features. Notable examples are online social networks and road networks. This special focus on network topology has led to the development of specialized graph databases. However, few of these databases offer a high-level declarative interface suited for analyzing information networks.

In this talk I present our work on developing a query language for analyzing networks. I will focus on the general principles we followed in the design of this language, and the main challenges related to developing it into a scalable tool for network analysis.

**Semi-random model tree ensembles: an effective and scalable regression method**

Bernhard Pfahringer (U. Waikato)

We present and investigate ensembles of semi-random model trees as a novel regression method. Such ensembles combine the scalability of tree-based methods with predictive performance rivalling the state of the art in numeric prediction. An empirical investigation shows that Semi-Random Model Trees produce predictive performance which is competitive with state-of-the-art methods like Gaussian Processes Regression or Additive Groves of Regression Trees. The training and optimization of Random Model Trees scales better than Gaussian Processes Regression to larger datasets, and enjoys a constant advantage over Additive Groves of the order of one to two orders of magnitude.

**Overlapping correlation clustering**

Aris Gionis (Yahoo! Labs)

Overlapping clustering, where a data point can be assigned to more than one cluster, is desirable in various applications, such as bioinformatics, information retrieval, and social network analysis. In this paper we generalize the framework of correlation clustering to deal with overlapping clusters. In short, we formulate an optimization problem in which each point in the dataset is mapped to a small set of labels, representing membership in different clusters. The number of labels does not have to be the same for all data points. The objective is to find a mapping so that the distances between points in the dataset agree as much as possible with distances taken over their sets of labels. For defining distances between sets of labels, we consider two measures: set-intersection indicator and the Jaccard coefficient.

To solve the problem we propose a local-search algorithm. Iterative improvement within our algorithm gives rise to non-trivial optimization problems, which, for the measures of set intersection and Jaccard, we solve using a greedy method and non-negative least squares, respectively.