Luca Gagliardelli

thesis

2018/2019

Techniques for Big Data Integration in Distributed Computing Environments

Data sources that provide a huge amount of semi-structured data are available on Web as tables, annotated contents (e.g. RDF) and Linked Open Data. These sources can constitute a valuable source of information for companies, researchers and government agencies, if properly manipulated and integrated with each other or with proprietary data. One of the main problems is that typically these sources are heterogeneous and do not come with keys to perform join operations, and effortlessly linking their records. Thus, finding a way to join data sources without keys is a fundamental and critical process of data integration. Moreover, for many applications, the execution time is a critical component (e.g., in finance of national security context) and distributed computing can be employed to significantly it. In this dissertation, I present distributed data integration techniques that allow to scale to large volumes of data (i.e., Big Data), in particular: SparkER and GraphJoin. SparkER is an Entity Resolution tool that aims to exploit the distributed computing to identify records in data sources that refer to the same real-world entity—thus enabling the integration of the records. This tool introduces a novel algorithm to parallelize the indexing techniques that are currently state-of-the-art. SparkER is a working software prototype that I developed and employed to perform experiments over real data sets; the results show that the parallelization techniques that I have developed are more efficient in terms of execution time and memory usage than those in literature. GraphJoin is a novel technique that allows to find similar records by applying joining rules on one or more attributes. This technique combines similarity join techniques designed to work on a single rule, optimizing their execution with multiple joining rules, combining different similarity measures both token- and character- based (e.g., Jaccard Similarity and Edit Distance). For GraphJoin I developed a working software prototype and I employed it to experimentally demonstrate that the proposed technique is effective and outperforms the existing ones in terms of execution time.

Luca Magnotta

thesis

2017/2018

Analysis and development of advanced data integration solutions for data analytics tools

This thesis shows my research and development activities performed on the MOMIS Dashboard, an interactive data analytics tool to explore and visualize the content of data sources through different types of dynamic views. The software is very versatile and supports connection to the main relational DBMSs and Big Data sources; for the data connection MOMIS Dashboard uses MOMIS, an Open Source data integration system that can integrate heterogeneous data sources. The research activity focused on the development of new tools in MOMIS that enhanced the ability to generate integrated schemas: the framework was integrated indeed with NORMS, a tool for the standardization of the schema labels, and with SparkER, a tool for entity resolution. Thanks to NORMS, MOMIS can find the semantic relationships existing between sources whose schema labels (i.e. the names of classes or attributes of a schema) contain acronyms, abbreviations and compound terms. SparkER, on the other hand, is a tool for Entity Resolution created by the DBGroup laboratory of the University of Modena and Reggio Emilia (Italy). It employs advanced Meta-Blocking techniques and thus outperforms other Entity Resolution tools based on Hadhoop MapReduce. The SparkER tool in MOMIS enables the schema matching based on the content of the data sources and not on the schema labels, thus going to determine the semantic relationships that otherwise would be difficult to identify even for domain experts. Finally, this thesis shows how MOMIS was used as a data integration engine to implement the MOMIS Dashboard tool. This tool was developed to create a data analytics tool that has been applied both in industrial contexts within the framework of the Italian industry 4.0 plan and in the medical scientific domain.

Giuseppe Fiameni

thesis

2017/2018

A distributed HPC infrastructure to process very large scientific data sets

The goal of this thesis work is to develop a distributed HPC infrastructure to support the processing of very large scientific data sets federating different compute and data resources across Europe. A set of common technical specifications has been derived to provide a high-level specification of the overall architecture and give details of key architectural elements that are essential for realizing the infrastructure, including scientific cases, emerging technologies and new processing methodologies. The work has been mainly fueled by the need to provide a scalable solution, handle new memory technologies, such as those based on non-volatile chips, provide easy access to data, improve user experience by fostering the convergence between traditional High Performance and Cloud Computing utilization models. Nowadays the main access model for large scale HPC systems is based on the scheduling of batch jobs. This approach is not connected with a requirement from the computational science community, but it reflects the predominant issue in the management of the HPC systems: the maximization of resource utilisation. Conversely, the situation differs when taking into consideration personal workstations or shared memory servers, where timesharing interactive executions are the norm. Our design, which proposes a new paradigm called “Interactive Computing” refers to the capability of a system to support massive computing workloads while permitting on-the-fly interruption by the user. The real-time interaction of a user with a program runtime is motivated by various factors, such as the need to estimate the state of a program or its future tendency, to access intermediate results, and to steer the computation by modifying input parameters or boundary conditions. Within the neuro-science community, one of the scientific cases taken in consideration in this work, most used applications (i.e. brain activity simulation, large image volume rendering and visualization, connectomics experiment) imply that the runtime can be modified interactively so that the user can gain insight on parameters, algorithmic behaviour, and optimization potentials. The commonly agreed central components of interactive computing are, on the front-end, a sophisticated user interface to interact with the program runtime and, on the back-end, a separated steerable, often CPU and memory consuming application running on an HPC system. A typical usage scenario for interactive computing regards the visualization, the processing, and the reduction of large amounts of data, especially where the processing cannot be standardized or implemented in a static workflow. The data can be generated by simulation or harvested from experiment or observation. In both cases, during the analysis the scientist performs an interactive process of successive reductions and production of data views that may include even complex processing like convolution, filtering, clustering, etc. This kind of processing could be easily parallelized to take advantage of HPC resources, but it would become clearly counterproductive to break-down a user session into separate interactive steps interspersed by batch jobs as their scheduling would delay the entire execution degrading the user experience. Besides that, in many application fields, the computational scientists are starting to use interactive frameworks and scripting languages to integrate the more traditional compute and data processing application running in batch, e.g. the use of R, Stata, Matlab/Octave or Jupyter Notebook just to name few. The work has been supported by Human Brain Project (www.humanbrainproject.eu) and Cineca (www.hpc.cineca.it), the largest supercomputing centre in Italy.

Song Zhu

thesis

2017/2018

Scalable Joins Methods and their applications in Data Integration Systems

Every second, we produce a large amount of data, as consequence, the ability to transform these data into useful information is crucial to manage efficiently the society we are living in. Such data can be stored into different and heterogeneous systems. Therefore, the integration is an important task for viewing and processing of these data. In this context, existing Data Integration techniques have to be upgraded to support the large and complex data. In this context, this dissertation has the goal to study and to improve the performance of critical operations in Data Integration. The main topic of this work is the Join operator in the Big Data Integration. Join is a key operator in the Data Integration. There are two types of joins most used in this area. The first is equi-join is used in the Merge Join Step, it is used to merge two or more data sources. The join used for this context is a join with equality predicate. In addition, it is usually an outer join, Since the data present in a data source may not be present in others. Another issue is the number of data sources. If you have many data sources with common attributes, using only one binary join can be inefficient. In this perspective, this dissertation propose a new join algorithm, ie. SOPJ. This join algorithm is created specifically to make the Merge Join Step more efficient, parallelizable and scalable. These features allows to manage efficiently not only large data sources but also a huge number of data sources. The second type of join is similarity join, this operator is used for many purposes in Data Integration, especially in data cleansing and normalization operations such as duplication detection and entity resolution. Similarity join has been widely studied in literature. With the Map Reduce paradigm, the study to make this operation efficient and scalable has become a hot topic. In this dissertation, we studied one of the most famous similarity join algorithms, PPJoin. We implemented this algorithm through Apache Spark, and we introduced improvements to make this algorithm more efficient. The experiment data show the effectiveness of the proposed solutions. Finally, we present an alternative to the similarity join for the entity resolution operation, called metablocking, and our contribution is to implement this method through Apache Spark to make scalable and usable metablocking for large data queues. The goal of this work is to study and to improve the scalability and efficiency of operations in Data Integration Systems, like MOMIS, to able to manage the huge amount of available data.

Giovanni Simonini

thesis

2015/2016

Loosely Schema-aware Techniques for Big Data Integration

A huge amount of semi-structured data is available on the Web in the form of web tables, marked-up contents (e.g. RDFa, Microdata), and Linked Open Data. For enterprises, governative agencies, and researcher of large scientific project, this data can be even more valuable if integrated with the data that they already own, and that are typically subject of traditional Data Integration processes. Being able to identify records that refer to the same entity is a fundamental step to make sense of this data. Generally, to perform Entity Resolution (ER), traditional techniques require a schema alignment between data sources. Unfortunately, the semi-structured data of the Web is usually characterized by high heterogeneity, high levels of noise (missing/inconsistent data), and very large volume, making traditional schema alignment techniques no longer applicable. Therefore, techniques that deal with this kind of data typically renounce to exploit schema information, and rely on redundancy to limit the chance of missing matches. This dissertation tackles two fundamental problems related to ER in the con- text of highly heterogeneous, noisy and voluminous data: (i) how to extract schema information useful for ER, from the data sources, without performing a traditional schema-alignment; (ii) how can this information be fully exploited to reduce the complexity of ER; in particular, to support indexing techniques that aim to group similar records in blocks, and limit the comparison to only those records appearing in the same block. We address those open issues introducing: a set of novel methodologies to induce loose schema information directly from the data, without exploiting the semantic of the schemas; and BLAST (Blocking with Loosely Aware Schema Techniques), a novel unsupervised blocking approach able to exploit that information to produce high quality block collections. We experimentally demonstrate, on real world datasets, how BLAST can outperform the state of the art blocking approaches, and, in many cases, also the supervised ones.

 

Fabio Benedetti

 

thesis

 

2015/2016

 

 

Revealing the underlying
structure of Linked Open Data
for enabling visual querying

 The Linked Data Principles ratified by Tim-Berners Lee promise that a large portion of Web Data will be usable as one big interlinked RDF (i.e. Resource Description Framework) database. Today, with more than one thousand of Linked Open Data (LOD) sources available on the Web, we are assisting to an emerging trend in publication and consumption of LOD datasets. However, the pervasive use of external resources together with a deficiency in the definition of the internal structure of a dataset causes that many LOD sources are extremely complex to understand. The goal of this thesis is to propose tools and techniques able to reveal the underlying structure of a generic LOD dataset for promoting the consumption of this new format of data. In particular, I propose an approach for the automatic extraction of statistical and structural information from a LOD source and the creation of a set of indexes (i.e. Statistical Indexes) that enhance the description of the dataset. By using this structural information, I defined two models able to effectively describe the structure of a generic RDF dataset: Schema Summary and Clustered Schema Summary. The Schema Summary contains all the main classes and properties used within the datasets, whether they are taken from external vocabularies or not. The Clustered Schema Summary, suitable for large LOD datasets,  provides a more high-level view of the classes and the properties used by gathering together classes that are object of multiple instantiations. All these efforts allowed the development of a tool called LODeX able to provide a high-level summarization of a LOD dataset and a powerful visual query interface to support users in querying/analyzing an unknown datasets. All the techniques proposed in this thesis have been extensively evaluated and compared with the state of the art in their field: a performance evaluation of the LODeX’s module delegated to the extraction of the indexes is proposed; the technique of schema summarization has been evaluated according to ontology summarization metrics; finally, LODeX itself has been evaluated inspecting its portability and usability. In the second part of the thesis, I present a novel technique called CSA (Context Semantic Analysis) that exploits the information contained in a knowledge graph for estimating the similarity between documents. This technique has been compared with other state of the art measures by using a benchmark containing documents an measures of similarity provided by human judges.

Marius Octavian Olaru

thesispresentation

2012/2013

 Heterogeneous DataWarehouse Analysis and Dimensional Integration

The DataWarehouse (DW) is the main Business Intelligence instrument for the analysis of large banks of operational data and for extracting strategic information in support of the decision making process. It is usually focused on a specific area of an organization. Data Warehouse integration is the process of combining multidimensional information from two or more heterogeneous DWs, and to present users an unified global overview of the combined strategic information from the DWs. The problem is becoming more and more frequent as the dynamic economic context sees many companies merges/acquisitions and the formation of new business networks, like co-opetition, where managers need to analyze all the involved parties and to be able to take strategic decisions concerning all the participants. The contribution of the thesis is to analyze heterogeneous DW environments and to present a dimension integration methodology that allows users to combine, access and query data from heterogeneous multidimensional sources. The integration methodology relies on graph theory and the Combined WordSense Disambiguation technique for generating semantic mappings between multidimensional schemas. Subsequently, schema heterogeneity is analyzed and handled, and compatible dimensions are uniformed by importing dimension categories fromone dimension to another. This allows users from different sources to have the same overview of the local data, and increases local schema compatibility for drill-across queries. The dimensional attributes are populated with instance value by using a chase algorithm variant based on the RELEVANT clustering approach. Finally, several quality properties are discussed and analyzed. Dimension homogeneity/heterogeneity is presented from the integration perspective; also the thesis presents the theoretical fundamentals under which mapping quality properties (like coherency, soundness and consistency) are preserved. Furthermore, the integration methodology will be analyzed when slowly changing dimensions are encountered.

 

Matteo Interlandi

 

2012/2013

 

On Declarative Data-Parallel Computation: Models, Languages and Semantics

If we put under analysis the plethora of large-scale data-processing tools avail- able nowadays, we can recognize two main approaches: a declarative approach pursued by parallel DBMS systems and firmly grounded on the relational model theory; and an imperative approach followed by modern data-processing “MapReduce- like” systems which are highly scalable, fault-tolerant, and mainly driven by industrial needs. Although there has been some work trying to bring together the two worlds, these works focus mainly on exporting languages and interfaces - i.e., declarative languages on top of imperative systems, or MapReduce-like functions over parallel DBMS – or in a systematic merging of the features of the two approaches. We advocate that, instead, a declarative imperative approach should be attempted: this is, the development of a new computational model with related language, based on the relational theory and following the same patterns commonly present in modern data-processing systems, while maintaining a declarative flavor.
The goal of this thesis is then to carry out a first step in this direction. More concretely, we developed a new synchronous computational model for relational distributed parallel data-processing, leveraging on previous works on relational transducers and transducer networks. Such computational model accepts declarative program specifications expressed in a version of Datalog¬ specifically tailored for parallel computation. Datalog¬ is a language lying in between logic and query languages and, thanks to its nature, not only data-driven parallel computation can be declarative expressed, but also the theoretical foundations connecting the semantics of programs with the emerging properties of their parallel execution can be explored.

Abdul Rahman Dannaoui

thesis

2011/2012

Information Integration for biological data sources

This thesis focuses on data integration and data provenance in the context of the MOMIS data integration system that was used to create the CEREALAB database. Its main contribution is the creation of the CEREALAB database V2.0 with new functionalities derived from the needs of the end users and the study of different data provenance models to finally create a new component for the MOMIS system in order to offer data provenance support for the CEREALAB users.


Serena Sorrentino

thesis presentation
2010/2011
Label Normalization and Lexical Annotation for Schema and Ontology Matching The goal of this thesis is to propose, and experimentally evaluate automatic and semi-automatic methods performing label normalization and lexical annotation of schema labels. In this way, we may add sharable semantics to legacy data sources. Moreover, annotated labels are a powerful means in order to discover Lexical Relationships among structured and semi-structured data sources. Original methods to automatically normalize schema labels and extract lexical relationships have been developed and their affectiveness for automatic schema matching shown.


Nana Mbinkeu Carlos

thesis 
2010/2011
Query Optimization and Quality-Driven Query Processing for Integration Systems This thesis focused on some core aspects in data integration, i.e. Query Processing and Data Quality. First this thesis proposed new techniques that consider the optimization of the full outerjoin operation, which is used in data integration systems for data fusion. Then this thesis demonstrated how to achieve Quality-Driven Query Processing, where quality constraints specified in Data Quality Aware Queries are used to perform query optimization.


Antonio Sala

thesis 
2009/2010
Data and Service Integration: Architectures and Applications to Real Domains This thesis focuses on Semantic Data Integration Systems, with particular attention to mediator system approaches, to perform data and service integration. One of the topics of this thesis is the application of MOMIS to the bioinformatics domain to integrate different public databases to create an ontology of molecular and phenotypic cereals data. However, the main contribution of this thesis is a semantic approach to perform aggregated search of data and services. In particular, I describe a technique that, on the basis of an ontological representation of data and services related to a domain, supports the translation of a data query into a service discovery process, that has also been implemented as a MOMIS extension. This approach can be described as a Service as Data approach, as opposed to Data as a Service approaches. In the Service as Data approach, informative services are considered as a kind of source to be integrated with other data sources, to enhance the domain knowledge provided by a Global Schema of data. Finally, new technologies and approaches for data integration have been investigated, in particular distributed architecture, with the objective to provide a scalable architecture for data integration. An integration framework in a distributed environment is presented that allows realizing a data integration process on the cloud.


Laura Po

thesis
2008/2009
Automatic Lexical Annotation: an effective technique for dynamic data integration. La tesi illustra come l'annotazione lessicale sia un elemento cruciale in ambito di integrazione dati. Grazie all'annotazione lessicale, vengono scoperte nuove relazioni tra gli elementi di uno schema o tra elementi di schemi diversi. Diversi metodi per eseguire automaticamente l'annotazione delle sorgenti dati vengono descritti e valutati in diversi scenari. L'annotazione lessicale può perfezionare anche sistemi per la scoperta di matching tra ontologie. Sono presentati alcuni esperimenti di applicazione dell'annotazione lessicale ai risultati di un matcher. Infine, viene introdotto l'approccio all'annotazione probabilistica e viene illustrata la sua applicazione nei processi di integrazione dinamici.


Mirko Orsini

thesis
2008/2009
Query Management in Data Integration Systems: the MOMIS approach.  This thesis investigates the issue of Query Management in Data Integration Systems, taking into account several problems that have to be faced during the query processing phase. The achieved goals of the thesis have been the study, analysis and proposal of techniques for effectively querying Data Integration Systems. The proposed techniques have been developed in the MOMIS Query Manager prototype to enable users to query an integrated schema, and to provide users a consistent and concise unified answer. The effectiveness of the MOMIS Query Manager prototype has been demonstrated by means of the THALIA testbed for Data Integration Systems. Experimental results show how the MOMIS Query Manager can deal with all the queries of the benchmark.

A new kind of metadata that offers a synthesized view of an attributes values, the relevant values, has been defined and the effectiveness of such metadata for creating or refining a search query in a knowledge base is demonstrated by means of experimental results.

The security issues in Data integration/interoperation systems have been investigated and an innovative method to preserve data confidentiality and availability when querying integrated data has been proposed. A security framework for collaborative applications, in which the actions that users can perform are dynamically determined on the basis of their attribute values, has been presented, and the effectiveness of the framework has been demonstrated by an implemented prototype.

Gionata Gelati

thesis

2002/2003

Agent Technology Applied to Information Systems
 The thesis is thus divided into three parts. In the first one, software agents are presented and critically compared to other mainstream technologies. We also discuss modeling issues. In the second part, some example systems where we applied agent technology are presented and the solution is discussed. The realistic scenarios and requirements for the systems were provided by the WINK and SEWASIE projects. The third part presents a logical framework for characterizing the interaction of software agents in virtual societies where they may act as representatives of humans.

 

Francesco Guerra

thesis presentation
2002/2003

flag ita Dai Dati all'Informazione: il sistema MOMIS

The thesis introduces the methodology for the construction of a Global Virtual View of structured data sources implemented in the MOMIS system. In particular, the thesis focuses on the problem of the management and update of multi-language sources. Noreover, the thesis proposes a comparison between MOMIS and the main mediators available in the literature. Finally, some applications of the MOMIS systems in the fields of Semantic Web and e-commerce (developed within National and European projects) have been proposed.


Ilario Benetti

thesis
2001/2002
Knowledge Management for Electronic Commerce applications This work summarizes the activities developed during the Ph. D studies in Information Engineering. It is organized in two parts. The first part describes the Knowledge Management Systems and their applications to Electronic Commerce. In particular, a technical and organizational  overview about the most critical issues concerning the Electronic Commerce applications is presented. This part is the result of a two years long research carried out in cooperation with Professor Enrico Scarso within the interdisciplinary – ICT and business organization – MIUR project “Il Commercio Eletronico: nuove opprtunità e nuovi mercati per le PMI”. The second part introduces the Intelligent Integration of Information (I3) research topic and presents the MOMIS system approach for I3. It outlines the theory underlying the MOMIS prototype and focuses on the generation of virtual catalogs issues in the electronic commerce environment exploiting the SIDesigner component. A new MOMIS architecture, based on XML Web Service, is finally proposed. The new architecture not only aims at addressing specific virtual catalogs’ issues, but it also lead to a general improvement of the MOMIS system.

Alberto Corni

thesis
1999/2000
Intelligent Information Integration: The MOMIS Project This thesis describes the work done during my Ph.D studies in Computer Engineering. It is organized in two parts. The first and main part describes the reseach project MOMIS for the Intelligent Integration of heterogeneous information. It outlines the theory for Intelligent Integration and the design and implementation of the prototype that implements the theoretical techniques. During my Ph.D. studies I stayed at the Northeastern University in Boston, Mass. (USA). Subject of the second part of this document is the work I did with Professor Ken Baclawski in information retrieval on annotation of documents using ontologies, and retrieval of the annotated documents.

Maurizio Vincini

pdf1 pdf2
1997/98
flag ita Utilizzo di tecniche di Intelligenza Artificiale nell'Integrazione di Sorgenti Informative Eterogenee Nella tesi di Dottorato viene presentato il sistema MOMIS (Mediator envirOnment for Multiple Information Sources), per l'integrazione di sorgenti di dati strutturati e semistrutturati secondo l'approccio della federazione delle sorgenti. Il sistema prevede la definizione semi-automatica dello schema univoco integrato che utilizza le informazioni semantiche proprie di ogni schema (col termine schema si intende l'insieme di metadati che descrive un deposito di dati).

Domenico Beneventano

thesis
1992/93
flag ita Uno Strumento di Inferenza nelle Basi di Dati ad Oggetti (Subsumption inference for Object-Oriented Data Models) Object-oriented data models are being extended with recursion to gain expressive power. This complicates both the incoherence detection problem which has to deal with recursive classes descriptions and the optimization problem which has to deal with recursive queries on complex objects. In this phd thesis, we propose a theoretical framework able to face the above problems. In particular, it is able to validate and automatically classify in a database schema, (recursive) classes, views and queries, organized in an inheritance taxonomy. The framework adopts the ODL formalism (an extension of the Description Logics developed in the area of Artificial Intelligence) which is able to express the semantics of complex object data models and to deal with cyclic references at the schema and instance level. It includes subsumption algorithms, which perform automatic placement in a specialization hierarchy of (recursive) views and queries, and incoherence algorithms, which detect incoherent (i.e., always empty) (recursive) classes, views and queries. As different styles of semantics: greatest fixed-point, least fixed-point and descriptive can be adopted to interpret recursive views and queries, first of all we analyze and discuss the choice of one or another of the semantics and, secondly, we give the subsumption and incoherence algorithms for the three different semantics. We show that subsumption computation and incoherence detection appear to be feasible since in almost all practical cases they can be solved in polynomial time algorithms. Finally, we show how subsumption computation is useful to perform Semantic query optimization, which uses semantic knowledge (i.e., integrity constraints) to transform a query into an equivalent one that may be answered more efficiently.
The phd thesis is in Italian. The content of this phd thesis can be found in the following two papers:
  • D. Beneventano, S. Bergamaschi, "Incoherence and Subsumption for recursive views and queries in Object-Oriented Data Models", Data & Knowledge Engineering 21 (1997), pag. 217-252, Elsevier Science B.V. (North- Holland). Abstract (ps), Paper (ps)
  • D. Beneventano, S. Bergamaschi, C. Sartori: "Description Logics for Semantic Query Optimization in Object-Oriented Database Systems", ACM Transaction on Database Systems, Volume 28: 1-50 (2003). Electronic Edition.