# Seminar

Please send email to larca cs.upc.edu to be included in the seminar announcement list.

Students are welcome at any time!

Speaker: Irena Koprinska, University of Sidney

Time: Oct 23rd, 2012

Place: A4014, building A4, campus nord

Online dating websites are used by millions of people. This talk will explore currentresearch at the University of Sydney, Computer Human Adaptive Interaction (CHAI)research lab, to build recommender systems for online dating, in collaboration with amajor Australian dating website. We will first show that similar people, as definedby a set of personal attributes, like and dislike similar people and are liked anddisliked by similar people. This analysis provides the foundation for ourcontent-collaborative recommender approach. We will then present a study of theimplicit and explicit user preferences. The explicit preferences are stated by theuser while the implicit preferences are inferred based on the user behavior on thewebsite using a machine learning method. We will discuss the effectiveness of theimplicit and explicit user preferences in predicting the success of user interactions.

Bio:

Irena Koprinska is a Senior Lecturer at the University of Sydney, visiting LARCA aspart of her sabbatical. Irena's research interests are in Machine Learning, DataMining and Neural Networks, and their applications for Pattern Recognition andRecommender Systems. For more information see Irena's web page:http://rp-www.cs.usyd.edu.au/~irena/

Time: Feb 23rd 2010, 10:00

Place: S208, floor -2, Omega Building.

Abstract: My application is recognition of emotions from speech. My contribution is a novel classification method.

I. the TGI+ classifier:

I adapted a philosophy of combining decision trees and tree automata, which originally comes from optical character recognition. The application to a new domain of speech emotions instead of optical character recognition requested devising new algorithms for the tree grammar inference part. The resulting classification algorithm incorporates statistical and syntactic learning. The syntactic part implements a tree grammar inference algorithm. The statistical part implements the entropy decision tree classifier C4.5.

I have further extended the idea with an built-in feature selection procedure. I tested the classifier on a benchmark data set of ACTED EMOTIONS. The proposed classifier outperformed a state of the art classifier, the multilayer perceptron, with a statistically significant difference in accuracies of 4.68% and the baseline of the C4.5, by 26.58%.

The main property of the TGI+ is human-readability of the classification process, which is of a potential application for example in the clinical context as an assessment and training tool for patients with impaired capabilities to express speech emotions.

II. further evolution to face big and noisy data sets:

I propose the high-level features, which are the distances from a feature vector to a tree automaton accepting class i, for all i in the set of the class labels. I propose to early-fuse the set of the low-level features and the set of the high-level features, submit the resulting set to a feature selection procedure and then do the classification step with the RIPPER-k classifier, which is, like the C4.5, a decision tree classifier yet more robust on big and noisy data sets. The proposed classification scheme outperformed the state of the art top performer (the SVM) with a statistically significant difference in accuracies of 7% on AUTHENTIC EMOTIONS.

Acted and authentic speech emotions are both important for practical applications, yet have been shown to be very different pattern recognition tasks.

Time: Feb 22nd 2010, 12:00

Place: Room S208, Omega Building

Summary: Most techniques for images classification are based on numerical vectors in high dimensional spaces. However, such representations may be difficult to extract in the case of small thumbnail images for instance. In this case, we claim that discrete representations, based on strings, trees or graphs are still challenging. In this talk, we model images with open plane graphs. We show that isomorphism and subisomorphism problems, known to be intractable in general, can be solved in quadratic time on this kind of graphs. Moreover, we use our algorithms to retrieve patterns within databases containing thousands of images. Finally, we shall give some hints concerning the development of new distances, based on the largest common subgraph, or on edit operations, that would allow one to classify

images using nearest-neighbour techniques.

Speaker: Xavier Carreras, LSI UPC

Time: feb 9th 2010, 15:30

Place: S208, floor -2, Omega Building.

Abstract:

Syntactic parsing is the fundamental problem of determining the structure of natural language sentences. It is a challenging task, because syntactic structures of natural languages are recursive, and there is a significant degree of ambiguity in determining how different parts of a sentence combine together syntactically.

In any computational model for parsing, the choice of grammar formalism is critical to both the representational power of the model and its computational efficiency. In this talk I will describe a variant of a Tree Adjoining Grammar (TAG) that can use a wide varitey of rich features and, at the same time, has efficient algorithms.

I will present two applications of our TAG. The first is a discriminative parser, a generalization of Conditional Random Fields for sequence prediction that extends the framework to syntactic parsing.

I will then present a system for Machine Translation that frames the task of translation as a parsing task.

No previous NLP knowledge required. This is joint work with Michael Collins and Terry Koo. Speaker: Alfredo Vellido, LSI

Time: dec 1st 2009, 11:05

Place: Room S208, Omega Building

Abstract:

De manera informal, hablaré sobre los proyectos de investigación en los que nuestro grupo, Soft Computing (SOCO), lleva involucrado desde 2006. En el área de biomedicina computacional, trabajamos en el desarrollo de métodos basados en técnicas de inteligencia artificial para ayudar a la toma de decisiones clínico-diagnósticas. No en cualquier área de medicina diagnóstica: en oncología. No en cualquier área de oncología: en neuro-oncología. El diagnóstico de tumores cerebrales es una tarea compleja y humanamente sensible. ¿Podemos ayudar a los expertos médicos mediante el uso de técnicas computacionales para el análisis de datos? Sólo si se dejan. ¿Es probable que se dejen? No. Y aunque se dejen, tengo mis dudas de que realmente podamos ser de gran ayuda. En cualquier caso, me agradaría que los asistentes a la charla me llevaran la contraria. Slides

Speaker: Ariadna Quattoni, LSI

Time: november 24th 2009, 15:30

Room: S208, Omega

Abstract:

Many applications in machine learning can be casted as sequence prediction problems, where given a sequence of input symbols the goal is to predict an output sequence that assigns a label to each input symbol. In natural language processing, for example, the problem of part of speech tagging involves predicting a sequence of part of speech tags for a given sentence (i.e. a sequence of words).

Conditional Random Fields (CRF) have been proposed to solve such sequence prediction problems. A CRF models the distribution of output sequences conditioned on an input sequence as a markov random field.

In this tutorial I will present the model and training algorithms. I will also frame CRFs as an instance of a larger class of factorized linear models for predicting structures. Finally, I will describe some extensions of the CRF model that allow for the incorporation of hidden variables. Slides (pdf).

Speaker: Josep Carmona, LSI, UPC

Date: November 5th, 2009

Time: 10:30

Place: Room S208, floor -2, Omega Building

Abstract:

Process mining is a relatively young area. The main goal is to incorporate the use of formal models and techniques in current systems to make them better. Among all branches, a main one is process discovery: from system traces, one wishes to find a formal model to represent them. This formal model may later be used for very different purposes (visualization, property verification, extension, bottleneck analysis, verification of conformity with the real system, etc.)

In this talk we will introduce modern techniques for formal model discovery, based on the theory of regions introduced in the nineties by Ehrenfeucht and Rozenberg. We will also enumerate some open problems and current research lines in our group.

No previous knowledge is required. Slides (pdf).

Speaker: Antti Ukkonen

Date: 21.05.2009

Time: 11:00-12:00

Place: sala s2-208 del omega

Abstract:

Rankings of items are a useful concept in a variety of applications, such as clickstream analysis, some voting methods, bioinformatics, and other fields of science such as paleontology. We address two problems related to such data. The first problem is about finding orders, while the second one is about analyzing sets of orders.

We consider two different tasks in the problem of finding orders. We can find orders either by computing an aggregate of a set of known orders, or by constructing an order for a previously unordered data set. For the first task we show that bucket orders, a subclass of partial orders, are a useful structure for summarizing sets of orders. We formulate an optimization problem for finding such partial orders, show that it is NP-hard, and give an efficient randomized algorithm for finding approximate solutions to it. Moreover, we show that the expected cost of a solution found by the randomized algorithm differs from the optimal solution only by a constant factor. For the second approach we propose a simple method for sampling orders for 0--1 vectors that is based on the consecutive ones property.

For analyzing orders, we discuss three different methods. First, we give an algorithm for clustering sets of orders. The algorithm is a variant of Lloyd's iteration for solving the k-means problem. We also give two different approaches for mapping orders to vectors in a high-dimensional Euclidean space. These mappings are used on one hand for clustering, and on the other hand for creating two dimensional visualizations (scatterplots) for sets of orders. Finally, we discuss randomization testing in case of orders. To this end we propose an MCMC algorithm for creating random sets of orders that preserve certain well defined properties of a given set of orders. The random data sets can be used to assess the statistical significance of the results obtained e.g. by clustering.

Dr. Antti UKKONEN received his Master's degree in 2004 from Helsinki University of Technology, and his doctoral degree from the same university in 2008. He has been working at Helsinki University of Technology and Helsinki Institute for Information Technology (HIIT) since 2004. His research interests are algorithmic data analysis and information retrieval. Prior to joining HIIT he has worked as a software engineer at Nokia and as a research assistant at Helsinki Institute of Physics. In 2003 he was visiting the European Organization for Nuclear Research (CERN). Currently he is on leave of absence from HIIT to visit Yahoo! Research/Fundacio Barcelona Media/ UPF in Barcelona.

Speaker: Stan Matwin, University of Ottawa

Date: 07.05.2009

Time: 10:00-11:00

Place: sala s2-208 del omega

Abstract:

With the increased sizes of datasets to be mined, it becomes critical to select and keep a smaller but representative instance subset for efficient and high quality data mining. Although diverse techniques have been proposed, an instance selection method that is model-independent and scalable is essential. In this paper, we show how to select instances using a set of random trees, using a search heuristic to minimize the number of selected instances. Our method has linear time complexity with the size of dataset. Empirical studies show that the size of datasets can be significantly reduced, and the reduced data can be used to train various classifiers without significantly sacrificing their prediction power. Finally, the comparisons with other instance selection methods show that the proposed method roughly matches the quality of the data selected by a state-of-the-art active learning method, but works much more efficiently.

Stan Matwin is a Professor of Computer Science at the University of Ottawa,and a Foreign Professor at the Institute of Computer Science, Polish Academy of Sciences. Former President of the Canadian AI Society (CSCSI) and the IPF WG on Machine Learning. His research is in machine learning, data Mining, and their applications. Actively involved in technology transfer between academic labs and industry, he has been received the Ontario Champion of Innovation Award.

Speaker: Josep Roure, Escola Universitària Politècnica de Mataró

Date: 24.03.2009

Time: 10:00-11:00

Place: sala s2-208 del omega

Abstract:

Increasingly, data-mining algorithms must deal with databases that continuously grow over time. These algorithms must avoid repeatedly scanning their databases. When database attributes are symbolic, ADtrees have already shown to be efficient structures to store sufficient statistics in main memory and to accelerate the mining process in batch environments. In this talk we present an efficient method to sequentially update ADtrees that is suitable for incremental environments.

Speaker: Marcel Alcoverro, TSC, UPC

Date: Thursday, jan 22

Time: 10:00-11:00

Place: sala s2-208 del omega

Abstract:

A Hidden Markov Model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data. There are three canonical problems associated with HMM: we present them and the canonical algorithms for their solution.

* Given the parameters of the model, compute the probability of a particular output sequence, and the probabilities of the hidden state values given that output sequence. This problem is solved by the forward-backward algorithm.

* Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the Viterbi algorithm.

* Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. This problem is solved by the Baum-Welch algorithm.

Speaker: Artur Dubrawski, Carnegie-Mellon University

Time: thursday, jan 8th, 12:00-13:00

Place: sala s2-208, Omega building

Abstract:

Biomedical security is a fundamental requirement for the well being of human communities around the world. Challenges related to maintaining it have recently motivated numerous research activities aimed at supporting detection of disease outbreaks in epidemiology, at identifying new patterns of antibiotic resistance of bacteria in microbiology, or at investigating virulence and pathways of development of food-borne illnesses in food safety, to mention just a few. Some of the challenges can be at least partially addressed via analysis of the available data. This talk reviews research into and applications of the data-driven analytic methodologies designed to support maintenance of biomedical security, currently performed at the Carnegie Mellon University Auton Lab.

Bio:

Dr. Artur Dubrawski is the Director of the Auton Lab and Systems Scientist in the Carnegie Mellon University Robotics Institute. He is also an Adjunct Faculty in the CMU Heinz College where he teaches graduate courses on data mining and business intelligence. He also holds courtesy appointments in the CMU Machine Learning Department and in the Department of Biomedical Informatics at the University of Pittsburgh. He received a Ph.D. in Robotics and Artificial Intelligence from Polish Academy of Sciences and a M.Sc. in Aircraft Engineering from Warsaw University of Technology. Dr. Dubrawski leads fundamental and applied research projects funded by the National Science Foundation, Centers for Disease Control and Prevention, U.S. Department of Agriculture, Food and Drug Administration, U.S. Department of Defense, U.S. Department of Homeland Security and by industrial partners. His ongoing research interests are in achieving practical autonomy of intelligent systems.

Speaker: Adrià Gascón, LSI, UPC (joint work with Guillem Godoy)

Time: thursday dec. 11th, 11:00-12:00

Place: sala s2-208 del omega

Abstract:

First-order term unification is an essential concept in areas like functional and logic programming, automated deduction, deductive databases, artificial intelligence, information retrieval, compiler design, etc. It is known to be decidable in linear time.

However, many of the applications in the areas mentioned above deal with large amounts of data, and some kind of internal succinct representation for terms is required in order to guarantee computability in an environment with a limited amount of resources.

In recent years there has been an increase of interest in compression mechanisms based on grammar representation. Singleton Tree Grammars (STGs) allow to succinctly represent terms with exponential size and depth. Several STG-based compressors have been implemented for applications in data bases, and specially for tree data structures. Efficient algorithms have been developed for checking whether two STGs represent the same term, and for solving other related pattern matching problems.

We prove that the first order term unification problem is decidable in polynomial time, even if the input terms are given represented with a STG. Our algorithm generates the most general unifier also in polynomial time, and represents it again with a STG. We use several known results over STGs, like the fact that some operations as representing subterms of terms, deciding equality and generating the preorder traversal of a term are efficiently computable for STG. We also define a restricted notion of depth in order to show that the overall execution time is polynomially bounded.

Speaker: Josep Roure, Escola Universitària Politècnica de Mataró

Date: 05.11.2008

Time: 10:00 – 11:30

Place: sala s2-208 del omega

Abstract:

Detection of adverse events in multivariate time series is one of the key objectives of applied biomedical informatics. This paper shows how specific detectors can be learnt from data to outperform manually designed alternatives, even if the training data is labeled only sparsely. Learning detectors is useful in that it allows reducing development and maintenance costs of event detection systems. We present results obtained using real-world food safety data at a range of configurations from training sets where experts label negative examples only (data instances which do not correspond to any known event of interest), to training sets where experts provide labels of many known events. We experimentally show that powerful detectors can be learnt from few negative examples, and that the power of detectors can be improved if more labeled data is available. That is important in practice because labeled data is often difficult and costly to get.

Speaker: Jesse Read, University of Waikato, New Zealand

Date: 21.10.2008

Time: 10:00

Place: sala s2-208 del omega

Abstract:

In the multi-label classification task in machine learning, instances are assigned a subset of labels. These labels are chosen from a predefined set (or hierarchy) of labels. This is an extension of the common single-label classification problem (also known as multi-class classification) where only a single label from the label set is assigned to each instance. Multi-label classification is quite natural to many domains, such as: text classification, scene and video classification, medical diagnosis, and biological applications such as protein function classification and genomics. For example, a newspaper article could be assigned labels both "Science" and "Technology".

The extra dimension of multi-label classification offers interesting challenges. It is important to consider the relationships between labels, but also to limit the computational complexity and exaggerated imbalances which result from this extra dimension. A relatively new concept is multi-label classification in an on-line context. The on-line context is also natural to many domains, particularly text classification.The ultimate goal of on-line multi-label classification then, is to detect and incorporate label relationships efficiently, and be able to adapt to changes in these relationships over time.

Students are welcome at any time!

Seminars

Recommender Systems for Online Dating | Irena Koprinska, U. Sidney | 23.10.2012 |

Julia Sidorova, Alex Computing | 23.02.2010 | |

Jean-Christophe Janodet, U. St. Étienne | 22.02.2010 | |

A TAG formalism for Parsing and Translation | Xavier Carreras, LSI UPC | 09.02.2010 |

Sangre, sudor y máquinas | Alfredo Vellido, LSI, UPC | 1.12.2009 |

Tutorial on Conditional Random Fields for Sequence Prediction | Ariadna Quattoni, LSI, UPC | 24.11.2009 |

Process Mining: Discovering formal models from system logs | Josep Carmona, LSI, UPC | 05.11.2009 10:30 |

Algorithms for Finding Orders and Analyzing Sets of Chains | Antti Ukkonen, Helsinki Institute for Information Technology and Yahoo! Research | 21.05.2009 11:00 |

A Discriminative Instance Selection Method Using Random Trees | Stan Matwin, University of Ottawa | 07.05.2009 10:00 |

Sequential update of ADtrees | Josep Roure, Escola Universitària Politècnica de Mataró | 24.03.09 10:00 |

Hidden Markov Models : three basic problems | Marcel Alcoverro, TSC, UPC | 22.01.09 10:00 |

Data-driven analytics in support of biomedical security | Artur Dubrawski, Carnegie-Mellon University | 08.01.09 12:00 |

Unification with Singleton Tree Grammars | Adrià Gascón, LSI, UPC (joint work with Guillem Godoy) | 11.12.08 11:00 |

Learning Detectors of Events in Multivariate Time Series | Josep Roure, Escola Universitària Politècnica de Mataró | 05.11.2008 10:00 |

On-line Multi-label Classification | Jesse Read, University of Waikato, New Zealand | 21.10.2008 10:00 |

Recommender systems for online dating

Speaker: Irena Koprinska, University of SidneyTime: Oct 23rd, 2012

Place: A4014, building A4, campus nord

Abstract

Online dating websites are used by millions of people. This talk will explore current

research at the University of Sydney, Computer Human Adaptive Interaction (CHAI)

research lab, to build recommender systems for online dating, in collaboration with a

major Australian dating website. We will first show that similar people, as defined

by a set of personal attributes, like and dislike similar people and are liked and

disliked by similar people. This analysis provides the foundation for our

content-collaborative recommender approach. We will then present a study of the

implicit and explicit user preferences. The explicit preferences are stated by the

user while the implicit preferences are inferred based on the user behavior on the

website using a machine learning method. We will discuss the effectiveness of the

implicit and explicit user preferences in predicting the success of user interactions.

Bio:

Irena Koprinska is a Senior Lecturer at the University of Sydney, visiting LARCA as

part of her sabbatical. Irena's research interests are in Machine Learning, Data

Mining and Neural Networks, and their applications for Pattern Recognition and

Recommender Systems. For more information see Irena's web page:

http://rp-www.cs.usyd.edu.au/~irena/

Abstract: Online dating websites are used by millions of people. This talk will explore currentresearch at the University of Sydney, Computer Human Adaptive Interaction (CHAI)research lab, to build recommender systems for online dating, in collaboration with amajor Australian dating website. We will first show that similar people, as definedby a set of personal attributes, like and dislike similar people and are liked anddisliked by similar people. This analysis provides the foundation for ourcontent-collaborative recommender approach. We will then present a study of theimplicit and explicit user preferences. The explicit preferences are stated by theuser while the implicit preferences are inferred based on the user behavior on thewebsite using a machine learning method. We will discuss the effectiveness of theimplicit and explicit user preferences in predicting the success of user interactions.

Bio:

Irena Koprinska is a Senior Lecturer at the University of Sydney, visiting LARCA aspart of her sabbatical. Irena's research interests are in Machine Learning, DataMining and Neural Networks, and their applications for Pattern Recognition andRecommender Systems. For more information see Irena's web page:http://rp-www.cs.usyd.edu.au/~irena/

##### Emotion recognition and the TGI+.x classifier

Speaker: Julia A. Sidorova, Alex ComputingTime: Feb 23rd 2010, 10:00

Place: S208, floor -2, Omega Building.

Abstract: My application is recognition of emotions from speech. My contribution is a novel classification method.

I. the TGI+ classifier:

I adapted a philosophy of combining decision trees and tree automata, which originally comes from optical character recognition. The application to a new domain of speech emotions instead of optical character recognition requested devising new algorithms for the tree grammar inference part. The resulting classification algorithm incorporates statistical and syntactic learning. The syntactic part implements a tree grammar inference algorithm. The statistical part implements the entropy decision tree classifier C4.5.

I have further extended the idea with an built-in feature selection procedure. I tested the classifier on a benchmark data set of ACTED EMOTIONS. The proposed classifier outperformed a state of the art classifier, the multilayer perceptron, with a statistically significant difference in accuracies of 4.68% and the baseline of the C4.5, by 26.58%.

The main property of the TGI+ is human-readability of the classification process, which is of a potential application for example in the clinical context as an assessment and training tool for patients with impaired capabilities to express speech emotions.

II. further evolution to face big and noisy data sets:

I propose the high-level features, which are the distances from a feature vector to a tree automaton accepting class i, for all i in the set of the class labels. I propose to early-fuse the set of the low-level features and the set of the high-level features, submit the resulting set to a feature selection procedure and then do the classification step with the RIPPER-k classifier, which is, like the C4.5, a decision tree classifier yet more robust on big and noisy data sets. The proposed classification scheme outperformed the state of the art top performer (the SVM) with a statistically significant difference in accuracies of 7% on AUTHENTIC EMOTIONS.

Acted and authentic speech emotions are both important for practical applications, yet have been shown to be very different pattern recognition tasks.

##### Searching Patterns in Images using Discrete Representations

Speaker: Jean-Christophe Janodet, Université Jean MonnetTime: Feb 22nd 2010, 12:00

Place: Room S208, Omega Building

Summary: Most techniques for images classification are based on numerical vectors in high dimensional spaces. However, such representations may be difficult to extract in the case of small thumbnail images for instance. In this case, we claim that discrete representations, based on strings, trees or graphs are still challenging. In this talk, we model images with open plane graphs. We show that isomorphism and subisomorphism problems, known to be intractable in general, can be solved in quadratic time on this kind of graphs. Moreover, we use our algorithms to retrieve patterns within databases containing thousands of images. Finally, we shall give some hints concerning the development of new distances, based on the largest common subgraph, or on edit operations, that would allow one to classify

images using nearest-neighbour techniques.

Speaker: Xavier Carreras, LSI UPC

Time: feb 9th 2010, 15:30

Place: S208, floor -2, Omega Building.

Abstract:

Syntactic parsing is the fundamental problem of determining the structure of natural language sentences. It is a challenging task, because syntactic structures of natural languages are recursive, and there is a significant degree of ambiguity in determining how different parts of a sentence combine together syntactically.

In any computational model for parsing, the choice of grammar formalism is critical to both the representational power of the model and its computational efficiency. In this talk I will describe a variant of a Tree Adjoining Grammar (TAG) that can use a wide varitey of rich features and, at the same time, has efficient algorithms.

I will present two applications of our TAG. The first is a discriminative parser, a generalization of Conditional Random Fields for sequence prediction that extends the framework to syntactic parsing.

I will then present a system for Machine Translation that frames the task of translation as a parsing task.

No previous NLP knowledge required. This is joint work with Michael Collins and Terry Koo. Speaker: Alfredo Vellido, LSI

Time: dec 1st 2009, 11:05

Place: Room S208, Omega Building

Abstract:

De manera informal, hablaré sobre los proyectos de investigación en los que nuestro grupo, Soft Computing (SOCO), lleva involucrado desde 2006. En el área de biomedicina computacional, trabajamos en el desarrollo de métodos basados en técnicas de inteligencia artificial para ayudar a la toma de decisiones clínico-diagnósticas. No en cualquier área de medicina diagnóstica: en oncología. No en cualquier área de oncología: en neuro-oncología. El diagnóstico de tumores cerebrales es una tarea compleja y humanamente sensible. ¿Podemos ayudar a los expertos médicos mediante el uso de técnicas computacionales para el análisis de datos? Sólo si se dejan. ¿Es probable que se dejen? No. Y aunque se dejen, tengo mis dudas de que realmente podamos ser de gran ayuda. En cualquier caso, me agradaría que los asistentes a la charla me llevaran la contraria. Slides

Speaker: Ariadna Quattoni, LSI

Time: november 24th 2009, 15:30

Room: S208, Omega

Abstract:

Many applications in machine learning can be casted as sequence prediction problems, where given a sequence of input symbols the goal is to predict an output sequence that assigns a label to each input symbol. In natural language processing, for example, the problem of part of speech tagging involves predicting a sequence of part of speech tags for a given sentence (i.e. a sequence of words).

Conditional Random Fields (CRF) have been proposed to solve such sequence prediction problems. A CRF models the distribution of output sequences conditioned on an input sequence as a markov random field.

In this tutorial I will present the model and training algorithms. I will also frame CRFs as an instance of a larger class of factorized linear models for predicting structures. Finally, I will describe some extensions of the CRF model that allow for the incorporation of hidden variables. Slides (pdf).

Speaker: Josep Carmona, LSI, UPC

Date: November 5th, 2009

Time: 10:30

Place: Room S208, floor -2, Omega Building

Abstract:

Process mining is a relatively young area. The main goal is to incorporate the use of formal models and techniques in current systems to make them better. Among all branches, a main one is process discovery: from system traces, one wishes to find a formal model to represent them. This formal model may later be used for very different purposes (visualization, property verification, extension, bottleneck analysis, verification of conformity with the real system, etc.)

In this talk we will introduce modern techniques for formal model discovery, based on the theory of regions introduced in the nineties by Ehrenfeucht and Rozenberg. We will also enumerate some open problems and current research lines in our group.

No previous knowledge is required. Slides (pdf).

Speaker: Antti Ukkonen

Date: 21.05.2009

Time: 11:00-12:00

Place: sala s2-208 del omega

Abstract:

Rankings of items are a useful concept in a variety of applications, such as clickstream analysis, some voting methods, bioinformatics, and other fields of science such as paleontology. We address two problems related to such data. The first problem is about finding orders, while the second one is about analyzing sets of orders.

We consider two different tasks in the problem of finding orders. We can find orders either by computing an aggregate of a set of known orders, or by constructing an order for a previously unordered data set. For the first task we show that bucket orders, a subclass of partial orders, are a useful structure for summarizing sets of orders. We formulate an optimization problem for finding such partial orders, show that it is NP-hard, and give an efficient randomized algorithm for finding approximate solutions to it. Moreover, we show that the expected cost of a solution found by the randomized algorithm differs from the optimal solution only by a constant factor. For the second approach we propose a simple method for sampling orders for 0--1 vectors that is based on the consecutive ones property.

For analyzing orders, we discuss three different methods. First, we give an algorithm for clustering sets of orders. The algorithm is a variant of Lloyd's iteration for solving the k-means problem. We also give two different approaches for mapping orders to vectors in a high-dimensional Euclidean space. These mappings are used on one hand for clustering, and on the other hand for creating two dimensional visualizations (scatterplots) for sets of orders. Finally, we discuss randomization testing in case of orders. To this end we propose an MCMC algorithm for creating random sets of orders that preserve certain well defined properties of a given set of orders. The random data sets can be used to assess the statistical significance of the results obtained e.g. by clustering.

Dr. Antti UKKONEN received his Master's degree in 2004 from Helsinki University of Technology, and his doctoral degree from the same university in 2008. He has been working at Helsinki University of Technology and Helsinki Institute for Information Technology (HIIT) since 2004. His research interests are algorithmic data analysis and information retrieval. Prior to joining HIIT he has worked as a software engineer at Nokia and as a research assistant at Helsinki Institute of Physics. In 2003 he was visiting the European Organization for Nuclear Research (CERN). Currently he is on leave of absence from HIIT to visit Yahoo! Research/Fundacio Barcelona Media/ UPF in Barcelona.

Speaker: Stan Matwin, University of Ottawa

Date: 07.05.2009

Time: 10:00-11:00

Place: sala s2-208 del omega

Abstract:

With the increased sizes of datasets to be mined, it becomes critical to select and keep a smaller but representative instance subset for efficient and high quality data mining. Although diverse techniques have been proposed, an instance selection method that is model-independent and scalable is essential. In this paper, we show how to select instances using a set of random trees, using a search heuristic to minimize the number of selected instances. Our method has linear time complexity with the size of dataset. Empirical studies show that the size of datasets can be significantly reduced, and the reduced data can be used to train various classifiers without significantly sacrificing their prediction power. Finally, the comparisons with other instance selection methods show that the proposed method roughly matches the quality of the data selected by a state-of-the-art active learning method, but works much more efficiently.

Stan Matwin is a Professor of Computer Science at the University of Ottawa,and a Foreign Professor at the Institute of Computer Science, Polish Academy of Sciences. Former President of the Canadian AI Society (CSCSI) and the IPF WG on Machine Learning. His research is in machine learning, data Mining, and their applications. Actively involved in technology transfer between academic labs and industry, he has been received the Ontario Champion of Innovation Award.

Speaker: Josep Roure, Escola Universitària Politècnica de Mataró

Date: 24.03.2009

Time: 10:00-11:00

Place: sala s2-208 del omega

Abstract:

Increasingly, data-mining algorithms must deal with databases that continuously grow over time. These algorithms must avoid repeatedly scanning their databases. When database attributes are symbolic, ADtrees have already shown to be efficient structures to store sufficient statistics in main memory and to accelerate the mining process in batch environments. In this talk we present an efficient method to sequentially update ADtrees that is suitable for incremental environments.

Speaker: Marcel Alcoverro, TSC, UPC

Date: Thursday, jan 22

Time: 10:00-11:00

Place: sala s2-208 del omega

Abstract:

A Hidden Markov Model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data. There are three canonical problems associated with HMM: we present them and the canonical algorithms for their solution.

* Given the parameters of the model, compute the probability of a particular output sequence, and the probabilities of the hidden state values given that output sequence. This problem is solved by the forward-backward algorithm.

* Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the Viterbi algorithm.

* Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. This problem is solved by the Baum-Welch algorithm.

Speaker: Artur Dubrawski, Carnegie-Mellon University

Time: thursday, jan 8th, 12:00-13:00

Place: sala s2-208, Omega building

Abstract:

Biomedical security is a fundamental requirement for the well being of human communities around the world. Challenges related to maintaining it have recently motivated numerous research activities aimed at supporting detection of disease outbreaks in epidemiology, at identifying new patterns of antibiotic resistance of bacteria in microbiology, or at investigating virulence and pathways of development of food-borne illnesses in food safety, to mention just a few. Some of the challenges can be at least partially addressed via analysis of the available data. This talk reviews research into and applications of the data-driven analytic methodologies designed to support maintenance of biomedical security, currently performed at the Carnegie Mellon University Auton Lab.

Bio:

Dr. Artur Dubrawski is the Director of the Auton Lab and Systems Scientist in the Carnegie Mellon University Robotics Institute. He is also an Adjunct Faculty in the CMU Heinz College where he teaches graduate courses on data mining and business intelligence. He also holds courtesy appointments in the CMU Machine Learning Department and in the Department of Biomedical Informatics at the University of Pittsburgh. He received a Ph.D. in Robotics and Artificial Intelligence from Polish Academy of Sciences and a M.Sc. in Aircraft Engineering from Warsaw University of Technology. Dr. Dubrawski leads fundamental and applied research projects funded by the National Science Foundation, Centers for Disease Control and Prevention, U.S. Department of Agriculture, Food and Drug Administration, U.S. Department of Defense, U.S. Department of Homeland Security and by industrial partners. His ongoing research interests are in achieving practical autonomy of intelligent systems.

Speaker: Adrià Gascón, LSI, UPC (joint work with Guillem Godoy)

Time: thursday dec. 11th, 11:00-12:00

Place: sala s2-208 del omega

Abstract:

First-order term unification is an essential concept in areas like functional and logic programming, automated deduction, deductive databases, artificial intelligence, information retrieval, compiler design, etc. It is known to be decidable in linear time.

However, many of the applications in the areas mentioned above deal with large amounts of data, and some kind of internal succinct representation for terms is required in order to guarantee computability in an environment with a limited amount of resources.

In recent years there has been an increase of interest in compression mechanisms based on grammar representation. Singleton Tree Grammars (STGs) allow to succinctly represent terms with exponential size and depth. Several STG-based compressors have been implemented for applications in data bases, and specially for tree data structures. Efficient algorithms have been developed for checking whether two STGs represent the same term, and for solving other related pattern matching problems.

We prove that the first order term unification problem is decidable in polynomial time, even if the input terms are given represented with a STG. Our algorithm generates the most general unifier also in polynomial time, and represents it again with a STG. We use several known results over STGs, like the fact that some operations as representing subterms of terms, deciding equality and generating the preorder traversal of a term are efficiently computable for STG. We also define a restricted notion of depth in order to show that the overall execution time is polynomially bounded.

Speaker: Josep Roure, Escola Universitària Politècnica de Mataró

Date: 05.11.2008

Time: 10:00 – 11:30

Place: sala s2-208 del omega

Abstract:

Detection of adverse events in multivariate time series is one of the key objectives of applied biomedical informatics. This paper shows how specific detectors can be learnt from data to outperform manually designed alternatives, even if the training data is labeled only sparsely. Learning detectors is useful in that it allows reducing development and maintenance costs of event detection systems. We present results obtained using real-world food safety data at a range of configurations from training sets where experts label negative examples only (data instances which do not correspond to any known event of interest), to training sets where experts provide labels of many known events. We experimentally show that powerful detectors can be learnt from few negative examples, and that the power of detectors can be improved if more labeled data is available. That is important in practice because labeled data is often difficult and costly to get.

Speaker: Jesse Read, University of Waikato, New Zealand

Date: 21.10.2008

Time: 10:00

Place: sala s2-208 del omega

Abstract:

In the multi-label classification task in machine learning, instances are assigned a subset of labels. These labels are chosen from a predefined set (or hierarchy) of labels. This is an extension of the common single-label classification problem (also known as multi-class classification) where only a single label from the label set is assigned to each instance. Multi-label classification is quite natural to many domains, such as: text classification, scene and video classification, medical diagnosis, and biological applications such as protein function classification and genomics. For example, a newspaper article could be assigned labels both "Science" and "Technology".

The extra dimension of multi-label classification offers interesting challenges. It is important to consider the relationships between labels, but also to limit the computational complexity and exaggerated imbalances which result from this extra dimension. A relatively new concept is multi-label classification in an on-line context. The on-line context is also natural to many domains, particularly text classification.The ultimate goal of on-line multi-label classification then, is to detect and incorporate label relationships efficiently, and be able to adapt to changes in these relationships over time.