Associate Professor in Computer Science at INSA Lyon | PhD (2011) - HDR (2020)
Currently on sabbatical from INSA Lyon, working as R&I Manager at Infologic since February 2018.
Google Scholar • DBLP • ORCID • HAL • ResearchGate • Publons • LinkedIn • X (Twitter)
I am an associate professor of Computer Science at the Institut National des Sciences Appliquées since 2012 (INSA de Lyon, France). My main research interests are artificial intelligence, data-mining, and especially pattern mining with formal concept analysis. Original methods and algorithms developped as part of my research are applied to various domains with a particular attention to software engineering, digitalization, olfaction in neuroscience, geo-located social media analysis, electronic sports and video game analytics. I bring extensive industry experience, having participated in several research projects and collaborations, including CIFRE, FUI, and EU FP7 Marie Curie. Additionally, I have a year of experience as an auto-entrepreneur, eight months in a startup focusing on knowledge transfer, and most notably, seven years working for a French PME software editor. During this time, I have developed a robust skill set and gained valuable expertise.
My scientific adventure began with the study of a binary relationship, very often illustrated by grocery store transaction data, linking customers and products they buy. How to make this relationship speak? What knowledge, behavioral habits, recommendations, etc. can we characterize? This initial question allowed me to travel through different application fields (biology, neuroscience, social networks and video games analytics, software engineering), seeking to implement or adapt data mining methods to try to understand some phenomena while properly formalizing data and patterns in the most rigorous way. My research accordingly follows three main axes: the formalism framing the methods (Formal Concept Analysis), the methodological and algorithmic aspects related in Data mining, and finally the Knowledge Discovery "in practice" through several concrete applications encountered during collaborations with other scientists or industrial partners.
The process of collecting and analyzing data to answer predictive, explanatory, and decision-making issues has come to be known as "data science" for more than thirty years. Firstly used only by scientists, mainly by statisticians, the term is now widely used in the academics and industrial world. This can be explained in two ways: (i) data is ubiquitous, large, and varied, and (ii) there has been an awareness of the omniscient potential of data. The latter can be economic, societal, scientific, or related to health-care, and is based not only on the data that an entity has, but also on data that it can get (sensors, social networks, open data, etc., freely or not) making the data a black oil that still needs algorithms, methods and methodologies, to be properly refined. One component of data science, Knowledge Discovery in databases (KDD), deals in particular with the Data-Information-Knowledge process with the aim of explaining relationships or discovering hidden properties. Opposed to a purely statistical approach, a family of methods has met an important success over the last twenty years: data-mining and especially pattern-mining. Their goal is to describe, summarize, raise hypotheses from data. In particular, pattern mining makes it possible to efficiently find regularities of various types (such as frequent patterns in a set of transactions, molecular sub-graphs characteristic of toxicity, locally co-expressed gene groups, etc.). In fact, where conventional approaches aim to validate or invalidate an hypothesis given a priori, the search of patterns is seen as an enumeration technique of all the possible hypotheses (a set of exponential size w.r.t the input data) verifying some given constraints or maximizing a certain interest for the expert. Once discovered, the best hypotheses can then be tested, validated or invalidated and ultimately validated as knowledge unit. My scientific adventure began with the study of a binary relationship, very often illustrated by grocery store transaction data, linking customers and products they buy. How to make this relationship speak? What knowledge, behavioral habits, recommendations, etc. can we characterize? This initial question allowed me to travel through different application fields (biology, neuroscience, social networks and video games analytics), seeking to implement or adapt data mining methods to try to understand some phenomena while properly formalizing data and patterns in the most rigorous way. This is the story of this manuscript, according to three main research axes: the formalism framing the methods (Formal Concept Analysis), the methodological and algorithmic aspects related in Data mining, and finally the Knowledge Discovery "in practice" through several concrete applications encountered during collaborations with other scientists or industrial partners.
Supervisors : Amedeo Napoli, Sébastien Duplessis
The main topic of this thesis addresses the important problem of mining numerical data, and especially gene expression data. These data characterize the behaviour of thousand of genes in various biological situations (time, cell, etc.).A difficult task consists in clustering genes to obtain classes of genes with similar behaviour, supposed to be involved together within a biological process.Accordingly, we are interested in designing and comparing methods in the field of knowledge discovery from biological data. We propose to study how the conceptual classification method called Formal Concept Analysis (FCA) can handle the problem of extracting interesting classes of genes. For this purpose, we have designed and experimented several original methods based on an extension of FCA called pattern structures. Furthermore, we show that these methods can enhance decision making in agronomy and crop sanity in the vast formal domain of information fusion.
Supervisors : Jean-François Boulicaut, Mehdi Kaytoue
The genuine supervision of modern IT systems presents new challenges in terms of scalability, reliability, and efficiency. Traditional operations and maintenance systems that rely on manual tasks and individual troubleshooting are inefficient. Rule-based inference engines, although useful for detecting anomalies and automating resolution, are limited in handling the large number of alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) proposes the use of advanced analytics and machine learning to improve and automate supervision systems. However, there are several challenges in this field. Firstly, the lack of unified terminology makes it difficult to compare contributions from different disciplines. The requirements and metrics for constructing effective AIOps models are not well-defined. Secondly, AIOps has primarily focused on predictive models for anomaly detection and failure prediction, neglecting descriptive models that can handle data quality and complexity concerns. Thirdly, the reliance on opaque black box models limits their adoption by industry practitioners who need a clear understanding of the decision-making process of maintenance models. Lastly, existing AIOps solutions often overlook performance evaluation and scalability issues when developing and evaluating incident management models. As part of this Ph.D. thesis, we propose several contributions to tackle these challenges more effectively. Firstly, we offer a systematic approach to AIOps that organizes the extensive knowledge surrounding it. By categorizing data-driven approaches from various research areas and disciplines according to industry standards and requirements, we provide a cohesive framework. Secondly, we explore the application of Subgroup Discovery and its generalization Exceptional Model Mining, a promising data mining technique, in the context of AIOps. This well-defined framework allows for the extraction of valuable hypotheses from large and diverse datasets. It enables users to understand, interact with, and interpret the underlying processes behind predictive models. Our contributions in this area include a practical application focused on identifying suspicious query fragments in large SQL workloads to pinpoint performance degradation issues. Additionally, we develop an interpretation mechanism for incident triage models, providing contextualized explanations for the model's decisions. Furthermore, we address the challenging problem of memory Java analysis using huge and complex datasets that incorporate hierarchical data. Lastly, we address the issue of scalability by studying incident deduplication, a well-known problem in the industry. Our goal is to efficiently retrieve the most similar crash reports by combining locality-sensitive hashing and learning-to-hash techniques within a unified framework.
Supervisors : Jean-François Boulicaut, Mehdi Kaytoue
It is extremely useful to exploit labeled datasets not only to learn models and perform predictive analytics but also to improve our understanding of a domain and its available targeted classes. The subgroup discovery task has been considered for more than two decades. It concerns the discovery of rules covering sets of objects having interesting properties, e.g., they characterize a given target class. Though many subgroup discovery algorithms have been proposed for both transactional and numerical data, discovering rules within labeled sequential data has been much less studied. In that context, exhaustive exploration strategies can not be used for real-life applications and we have to look for heuristic approaches. In this thesis, we propose to apply bandit models and Monte Carlo Tree Search to explore the search space of possible rules using an exploration-exploitation trade-off, on different data types such as sequences of itemset or time series. For a given budget, they find a collection of top-k best rules in the search space w.r.t chosen quality measure. They require a light configuration and are independent from the quality measure used for pattern scoring. To the best of our knowledge, this is the first time that the Monte Carlo Tree Search framework has been exploited in a sequential data mining setting. We have conducted thorough and comprehensive evaluations of our algorithms on several datasets to illustrate their added-value, and we discuss their qualitative and quantitative results. To assess the added-value of one or our algorithms, we propose a use case of game analytics, more precisely Rocket League match analysis. Discovering interesting rules in sequences of actions performed by players and using them in a supervised classification model shows the efficiency and the relevance of our approach in the difficult and realistic context of high dimensional data. It supports the automatic discovery of skills and it can be used to create new game modes, to improve the ranking system, to help e-sport commentators, or to better analyse opponent teams, for example.
Supervisors : Mehdi Kaytoue, Céline Robardet
As the title of this dissertation may suggest, the aim of this thesis is to provide an order-theoretic point of view on the task of subgroup discovery. Subgroup discovery is the automatic task of discovering interesting hypotheses in databases. That is, given a database, the hypothesis space the analyst wants to explore and a formal way of how the analyst gauges the quality of the hypotheses (e.g. a quality measure); the automated task of subgroup discovery aims to extract the interesting hypothesis w.r.t. these parameters. In order to elaborate fast and efficient algorithms for subgroup discovery, one should understand the underlying properties of the hypothesis space on the one hand and the properties of its quality measure on the other. In this thesis, we extend the state-of-the-art by: (i) providing a unified view of the hypotheses space behind subgroup discovery using the well-founded mathematical tool of order theory, (ii) proposing the new hypothesis space of conjunction of linear inequalities in numerical databases and the algorithms enumerating its elements and (iii) proposing an anytime algorithm for discriminative subgroup discovery on numerical datasets providing guarantees upon interruption.
Supervisors : Jean-François Boulicaut, Mehdi Kaytoue
The discovery of patterns that strongly distinguish one class label from another is still a challenging data-mining task. Subgroup Discovery (SD) is a formal pattern mining framework that enables the construction of intelligible classifiers, and, most importantly, to elicit interesting hypotheses from the data. However, SD still faces two major issues: (i) how to define appropriate quality measures to characterize the interestingness of a pattern; (ii) how to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is unfeasible. The first issue has been tackled by Exceptional Model Mining (EMM) for discovering patterns that cover tuples that locally induce a model substantially different from the model of the whole dataset. The second issue has been studied in SD and EMM mainly with the use of beam-search strategies and genetic algorithms for discovering a pattern set that is non-redundant, diverse and of high quality. In this thesis, we argue that the greedy nature of most such previous approaches produces pattern sets that lack diversity. Consequently, we formally define pattern mining as a game and solve it with Monte Carlo Tree Search (MCTS), a recent technique mainly used for games and planning problems in artificial intelligence. Contrary to traditional sampling methods, MCTS leads to an any-time pattern mining approach without assumptions on either the quality measure or the data. It converges to an exhaustive search if given enough time and memory. The exploration/exploitation trade-off allows the diversity of the result set to be improved considerably compared to existing heuristics. We show that MCTS quickly finds a diverse pattern set of high quality in our application in neurosciences. We also propose and validate a new quality measure especially tuned for imbalanced multi-label data.
Supervisors : Jean-François Boulicaut, Mehdi Kaytoue
In a manufacturing context, a product is moved through different placements or sites before it reaches the final customer. Each of these sites have different functions, e.g. creation, storage, retailing, etc. In this scenario, traceability data describes in a rich way the events a product undergoes in the whole supply chain (from factory to consumer) by recording temporal and spatial information as well as other important elements of description. Thus, traceability is an important mechanism that allows discovering anomalies in a supply chain, like diversion of computer equipment or counterfeits of luxury items. In this thesis, we propose a methodological framework for mining unitary traces using knowledge discovery methods. We show how the process of data mining applied to unitary traces encoded in specific data structures allows extracting interesting patterns that characterize frequent behaviors. We demonstrate that domain knowledge, that is the flow of products provided by experts and compiled in the industry model, is useful and efficient for classifying unitary traces as deviant or not. Moreover, we show how data mining techniques can be used to provide a characterization for abnormal behaviours (When and how did they occur?). We also propose an original method for detecting identity usurpations in the supply chain based on behavioral data, e.g. distributors using fake identities or concealing them. We highlight how the knowledge discovery in databases, applied to unitary traces encoded in specific data structures (with the help of expert knowledge), allows extracting interesting patterns that characterize frequent behaviors. Finally, we detail the achievements made within this thesis with the development of a platform of traces analysis in the form of a prototype.
Le conservatoire national des arts et metiers propose une formation riche dans divers domaines, afin de permettre à ses auditeurs une ré-insertion professionnelle, une mise-à-niveau des connaissances pour une possible promotion professionnelle, ou simplement pour repondre à l'intérêt de chacun. Elle prépare au titre d'ingénieur du CNAM, diplôme reconnu. J'ai dispensé pendant deux ans l'enseignement d'intelligence artificielle dans le dernier cycle du parcours. Ce cours de 50 heures par an introduit les problematiques de l'intelligence artificielle et différentes manières de formaliser et resoudre des problèmes. Merci à Amedeo Napoli pour les supports de qualité qu'il m'avait transmis.
Université Nancy 2, UFR de mathématiques et informatique.
An algorithm to mine (frequent) closed numerical patterns and their generators (convex hulls) https://github.com/mehdi-kaytoue/MinIntChange
A plugin for SC2Gears that allows to generate various sequences mining algorithms https://github.com/mehdi-kaytoue/Sc2Gears4DM
TriMax algorithm for computing maximal biclusters of similar values https://github.com/mehdi-kaytoue/trimax
An algorithm for Mining Contextual Exceptional Subgraphs https://github.com/mehdi-kaytoue/contextual-exceptional-subgraph-mining
Algorithms to mine balanced sequential patterns, jointly realized with Guillaume Bosc during his master studies
Coron is a data-mining suite of software for formal concept analysis and pattern mining created by Lazlso Szathmary. I contributed to the developments and dissemination.
Feel free to reach out via LinkedIn