cif

Documentation

Introduction

"Cervelli in Fuga" is a project on young Italian people and their choice of leaving their home country. Often discussed in the media, the problem of emigration in Italy has various factors that need to be taken into account in order to understand what really lies behind the big number of young people choosing to live and work abroad.

The workflow was divided as follows:

Lucia Bertoldini - project ideation, legal analysis, visualization.
Giorgia Crosilla - project ideation, quality analysis, analysis of the datasets and creation of mash-up datasets, website development.
Gaia Ortona - project ideation, ethical and technical analysis, RDF assertion of the metadata.

Scenario

This project aims to investigate what might be the possible reasons behind many young people's choice of leaving Italy. Our hypothesis is that the difficulty in gaining a proper economic indipendence and the lack of meritocracy are the main aspects that bring young Italians to leave their home country. In order to consider this hypothesis on the level of data, we have considered 16 datasets related to graduate rate employment, house prices, hourly wage and work satisfaction. The time span taken into account for this analysis goes from 2014 to 2022, as our aim is to give a full picture of young Italians' working conditions in the most recent years. Of course, this project does not aim to draw any conclusions on the matter: in fact, data is not enough to tell the full story and the fact that these aspects might or might not be correlated does not imply that anything in terms of causality. Moreover, the team of this project selected the datasets on the basis of the hypothesis previously stated: therefore, the choice of the datasets is not scientifically grounded. Within the sphere of interest, datasets were selected on the basis of availability and compatibility with one another. All these factors considered, it is once again necessary to state that this project is the elaboration of an hypothesis and does not aim to make any type of statements. Further research is needed in order to assess a conclusion, and more datasets need to be taken into account to cross-check the results.

Datasets

Sources of the original datasets:

ISTAT is the primary source of statistics in Italy, regarding population, wages, environmental issues, social problems, all gathered through surveys and analyses. Here, focusing on young people and the reason why they emigrate, we decided to analyze quite a number of datasets coming from this source, exploring hourly wages, price of houses, percentage of young people still living with their parents, work satisfaction and percentage of young italians that have emigrated.
ALMALAUREA is a consortium joined by 75 Italian universities and the Ministry of Education, Universities and Research, with the purpose or objectives of statistical studies related to the Italian university world. This source has been used to extract information about the rate of employed graduates in Italy.

The datasets obtained from Istat have already been filtered at the source, considering common and necessary variables to be displayed within the datasets. On the other hand, for datasets obtained from AlmaLaurea, it was not possible to filter the necessary data at the source, so it was necessary to download the entire CSV files, provided year by year. Subsequently, especially for the latter, a cleaning process was carried out to select only the necessary data and merge the data together.

Analysis

Quality

The "National Guidelines for Enhancing Public Information Heritage", in particular considering the section related to Data quality, underline four among other characteristics that are necessary to point out when analyzing the goodness of data. This step is necessary in order to guarantee the "certification of the accuracy of the provided data and, above all, its suitability for the intended use." These four characteristics are:

Accuracy (syntactic and semantic)
Coherence
Completeness
Timeliness (or timeliness of updating)

In summary, accuracy and consistency are more or less consistently maintained regarding the data within the dataset. The same cannot be said for column headers, where syntactic inaccuracies are often found. Completeness is not guaranteed across the board, as in some cases, there is missing data for the year 2022, and all datasets from AlmaLaurea lack metadata related to the datasets. Timeliness is not always adhered to, and particularly in the case of AlmaLaurea, there is no information regarding the data publication frequency or, specifically, the last update.

Legal

Before reasling datasets to the public, it is necessary to investigate the legal framework that allows or resctrits its publication.

The following check list is an instrument to verify the legal aspects connected to the publication of data. Basing ourselves on this checklist, for each dataset we evaluated the following topics: privacy issues, IPR policy, licenses, limitations on public access, economical conditions, and temporal aspects of the dataset.

Privacy: GDPR Regulation (EU) 2016/679, Regulation (EU) 2018/1807, Directive 2002/58/EC;
Open data directive, PSI: Directive (EU) 2019/1024
Copyright directive, CDSM: DIRECTIVE (EU) 2019/790
INSPIRE: Directive 2007/2/EC that defines particular limitation on public access for spatial and geo data.

We performed the legal analysis directly on the platforms that initially hosted the datasets, in order to have a full picture of the legal status of these data. Therefore, all datasets analysed are not owned by the creators of this project. However, the datasets used for the visualizations are a result of the mesh-up between different datasets and they are owned by "Cervelli in Fuga". Therefore, all datasets processed and created during the mesh-up phase are updated to January 2024. For complete and precise information, however, the user is invited to refer to the original datasets linked in the section related to the Quality Analysis.

Ethical

Data ethics involves the responsible and ethical handling of data throughout its lifecycle, from collection to processing, analysis, and dissemination.

For our analysis we took into consideration the main principles of data ethics (fairness, accountability, transparency, anonymity) based on the information provided by Data ethics: principles and guidelines for Companies, Authorities and Organizations and Data ethics framework by the UK government.

As far as we could see, no specific ethical issue was encountered, yet it is worth analysing each data source in detail:

ISTAT: As per the information available on ISTAT's official website, the institution is dedicated to upholding ethical and legal standards. Its core objective is to efficiently publish and communicate statistical information and analysis outcomes. The primary goal is to raise awareness regarding Italy's current conditions and contribute to informed decision-making among both private entities and public institutions. Additionally, ISTAT is actively engaged in methodological and applied research endeavors, aiming to enhance statistical production processes and promote a better understanding of statistics throughout Italy. As stated here, ISTAT makes their data completely accessible and reusable in compliance with the DECRETO-LEGGE 18 ottobre 2012, n. 179. The data are transparently managed and are anonymised in order to protect the privacy of the respondents. Regarding accountability, ISTAT's quality policy aligns with the European framework established by Eurostat. It adheres to the principles outlined in the European Statistics Code of Practice, which serves to enhance both accountability and governance within the European Statistical System and its constituent National Statistical Systems. This alignment underscores ISTAT's commitment to maintaining high standards and consistency in statistical practices, in harmony with broader European statistical guidelines.
ALMALAUREA: Since 1996, AlmaLaurea has annually captured the state of the Italian University through two surveys focusing on the academic profile and occupational status. The data are accessible and reusable for non-commercial purposes and they ensure anonymity for the respondents. Regarding transparency, Almalaurea is required according to Article 5, paragraph 1 of Legislative Decree 33/2013 to publish documents, information, or data on their institutional website as stipulated by the regulations. This entails the right of anyone to request the same information.

Technical

Format

The data from Istat is made available in various versions for download, such as CSV, JSON, Excel, and SDMX-CSV. The latter is an open standard format as it has a formal specification (XML). The other formats are not open standards, but CSV, in particular, is a format used for the exchange of tabular data. On the other hand, data from AlmaLaurea is available for download only in CSV format.

Metadata

The "National Guidelines for Enhancing Public Information Heritage", in particular considering the section related to Metadata Model, highlight four distinct levels of metadata that progressively convey the degree of connection between data and metadata, along with the level of detail.
The metadata from Istat has a dual description that places them at both level 4 and level 2. Level 2 represents a weak bond between data and metadata and it is represented by external metadata accessible from the SIqual portal (cases where this information can be accessed are reported in the "timeliness" section of the "Quality analysis" table). Level 4 is represented by the data contained in SDMX format files that link metadata to each individual data point. This combination allows for metadata that is comprehensible both to humans and machines. Datasets from AlmaLaurea do not include metadata, so they fall into level 1.

Licenses

During the preliminary technical analysis, the licenses used within the original datasets are:

ISTAT data are made available under the CC-BY-3.0 license, as indicated here. This license allows for "reproduction, distribution, transmission, and adaptation of data and analyses from the National Institute of Statistics, even for commercial purposes, provided that the source is cited."
On the other hand, in the case of AlmaLaurea, there is no explicit reference on the type license used, with the only indication being "Unless otherwise indicated, reproduction for non-commercial purposes with citation of the source is authorized".

Considering the two different licenses used for the original datasets, we had to find a commmon ground when choosing the license for publishing the mash-up datasets. Datasets I2, I3, I5 are entirely published using CC-BY-SA-4.0, while I4 is published meshing CC-BY-SA-4.0 and CC-BY-NC-4.0 (see here). The second license is used exclusively for data originating from AlmaLaurea, which has a license that does not permit commercial distribution.

Other licenses used are:

The python scripts used for the cleaning and meshing process and the GitHub repository are published under The MIT License.
Metadata are published using CC-BY-SA-4.0.
The project is published under the license CC-BY-SA-4.0.

SUSTAINABILITY OF THE UPDATE OF THE DATASETS OVER TIME

All data and information available on this website have been published under the CC BY-SA 4.0 and CC BY-NC 4.0 licenses and are compliant with the FAIR principles.

Findable
The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.
The website is readily discoverable and accurately cataloged by search engines, ensuring optimal visibility. Additionally, every dataset generated for the conducted analysis is distinctly identified, enhancing traceability and accountability throughout the research process. This meticulous approach not only facilitates easy retrieval of information but also ensures the authenticity and uniqueness of each dataset involved in the analysis.

Accessible
Once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation.
All gathered data is conveniently accessible both on this platform and within the corresponding GitHub Repository. Importantly, these datasets will persistently remain available and downloadable, even in the event that the original sources decide to remove them in the future. This commitment to sustained accessibility underscores our dedication to preserving and sharing valuable information, fostering transparency, and enabling continued utilization for research and analytical purposes.

Interoperable
The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
The combined datasets are meticulously characterized in accordance with the metadata guidelines outlined in the DCAT-AP version 2.0.0 specifications. These guidelines are meticulously designed to ensure interoperability with other widely recognized and utilized formats such as SKOS, FOAF, PROV-O, and DCTerms. This systematic adherence to established standards not only enhances the coherence and structure of the metadata but also facilitates seamless integration with diverse platforms and tools, promoting broader accessibility and utilization of the mashed-up datasets across different domains.

Reusable
The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
The information presented on the website is available for unrestricted reuse, subject to compliance with the specified output license. Users are encouraged to freely utilize and repurpose the data in accordance with the terms outlined in the license agreement. This ensures that individuals and entities have the freedom to apply, share, and adapt the published data, fostering a collaborative and open approach to the dissemination of information.

As of the finalization of our project for the course "Open Access and Digital Ethics" at the University of Bologna, for the academic year 2023/2024, there are no plans for future updates to the resource we have developed. It is essential to note that the original datasets we have analyzed are regularly updated on an annual basis by their source, establishing it as the most dependable information repository in this domain. Therefore, we recommend that interested users refer to the platform directly for any subsequent changes or updates to the data in the coming years.

Mash up

The dataset mash-up was carried out in two phases: the creation of intermediate datasets and then the creation of the final mashed datasets.

Intermediate datasets.

The creation of intermediate datasets was necessary to facilitate the subsequent combination of datasets, particularly considering the initial data from AlmaLaurea. As mentioned earlier, it was necessary to download the datasets in their entirety from the source, as filtering was not possible. Additionally, the CSV file was not formatted correctly, requiring a cleaning process where only a specific row of the dataset related to the employment of individuals 3 years after obtaining a master's degree. In this case, the intermediate dataset allows the consolidation of the previous 9 datasets into a single dataset, bringing together all the data collected over the 2014-2021 timeframe.
The second intermediate dataset combines data on house price indices and the percentage of young people still living with their parents. This dataset was created as an indicator of economic independence and is necessary for conducting the final analysis presented in the last dataset mash-up.

Mash-up datasets.

The process of creation of the four different datasets is explained in the Jupyter Notebook using both text and code.

I2 : Working status by age This dataset merges data related to the satisfaction of young people at work and the gross hourly wage retribution. The goal is trying to understand if there is a correlation between the wage and the satisfaction at work, and if this could be a parameter that could influence emigration to foreign countries. It is necessary to notice that the age span analyzed in D4 is 15-24 while D6 consideres an age bracket 15-29 years. Even though the two age spans do not coincide, we still considered them as representative of young people, and since no alignment of this data was possible, we chose to work with these different age spans. Furthermore, D6 does not contain data related to 2022, for this reason this mash-up dataset takes into consideration only data from 2014 to 2021.
I3 : Work wages across sectors This dataset aims at combining data related to the average salary based on type of contract, age of the employees and educational qualification. The final dataset takes in consideration not only the temporal span, but also italian geographical macroareas (North-west, North-east, Center and South) and the economic activity. In all of the three starting datasets only data related to 2014-2021 was available.
I4 : Graduates' employment status This dataset focuses mainly on the graduates' working conditions, in particular considering their level of satisfaction at work, the employment rate and the average wage. This analysis is quite general, since no geographical subdivision nor economic activities are taken into consideration. This is due to the fact that this information was not present in D4 and D17. Moreover, the age span considered in this analysis is not consistent:
- D4 presents an age span of 15-24
- D17 consideres employment rate of graduates after 3 years from graduation. No data about their age is present.
- D7 consideres only hourly wage by educational qualification with no data related to age.
Although we had considered eliminating the age range in D4, removing this variable would have meant not paying proper attention to young people. D4 does not provide any information regarding an age group considered as "youth" different from the one indicated, so the chosen age range has been retained. D17 provides information related to the 3 years following the completion of a master's degree. According to AlmaLaurea data, on average, obtaining a master's degree occurs around the age of 27, so the reference to these data would be relevant to 30 years old graduates. The hourly salary based on the level of education is considered as the median, allowing for the analysis of the central position relative to the ordered data and is not sensitive to the presence of outliers. Even in this case the temporal span considered for this analysis is 2014-2021 since D7 contained only data related to this time span.
I5 : Emigration and economic independence This dataset contains the number of young people emigrated from Italy, indices of house prices and number of young people still living at home with their parents from 2014 to 2022.

Visualization

Data visualization is a key step in the process of turning raw data into information understandable to everyone. That is why we processed our main datasets resulting from the mash-up into visualization charts. During this process, we encountered some difficulties primarily given by the fact that within the same datasets there was data expressed in different measuraments units (percentages, units of people, euros). However, the problem was solved by making sure that the data did not mix in the visualization (for example, by referring to the same y axis). Out of four visualizations, three are charts that aim to show the evolution in time of a specific phenomenon. For the fourth visualization, we chose to show the differences by highlighting the role played by the geographical area on various phenomena.

Visualization tools

D3.js: a JavaScript library for manipulating documents based on data, which uses HTML, SVG and CSS; and combines powerful visualization components and a data-driven approach to DOM manipulation;
Flourish: a software for data visualization that allows graphic manipulation and the creation of interactive charts and maps.

RDF assertion of the metadata

The produced mashed-up datasets have been described in their metadata specification following DCAT AP version 2.0.0, an RDF vocabulary designed to facilitate interoperability between data-catalogs published on the Web, and the resulting Turtle serialization can be accessed by selecting the item of interest in the following list:

Mashup dataset I2

Mashup dataset I3

Mashup dataset I4

Mashup dataset I5

Cif Catalogue

The datasets have been analysed both as single individual dcat:Dataset and as a whole as a dcat:Catalog gathering them all. The main used metadata properties have been summarized in the next table, with specific attention at including both mandatory elements and the highest number of reccomended and optional information as possible to meaningfully enrich the collection.

	Datasets	Catalogue
Identifiers	adms:identifier, dcterms:title	adms:identifier, dcterms:title
Description	dcterms:description, dcat:keyword	dcterms:description, dcat:keyword
Temporal	dcterms:issued, dcterms:modified	dcterms:issued, dcterms:modified
Spatial	prov:wasDerivedFrom
Composition	dcterms:hasPart	dcat:datasets
Agents	dcterms:publisher, dcterms:creator	dcterms:publisher, dcterms:creator
Legal	dcterms:rightsHolder, dcterms:license	dcterms:license
Distribution	dcat:distribution
Language	dcterms:language	dcterms:language
Web	dcat:Landingpage	foaf:homepage

Conclusion

We've conducted thorough research on the status of young people in Italy in regards to their working conditions and satisfaction, in order to investigate the reasons that might be behind the high level of emigration in this country. Our approach involved gathering and analyzing data from various sources, such as Istat (Istituto Nazionale di Statistica) and AlmaLaurea. By combining and examining data from diverse origins, we aimed to gain an thorough understanding of this issue.

To address the competency questions presented at the start of the project:

It is not possible, considering the available data, to assert with certainty a correlation on how youth emigration may be dependent on the factors we have analyzed. According to these data, young people seem to be quite satisfied with their jobs, even though their hourly wages have increased by a few cents over the analyzed 8 years. Regarding salary analysis, discrepancies between wages in Northern Italy, especially Northwest, and Central-South Italy are consistently observed. Furthermore, in the accomodation, education, human health, arts and entertainment sectors, it appears that there is little difference in salary among young workers, permanent employees, and those with different educational qualifications. The data indicate that the salary consistently revolves around 10/11 euros per hour. Moreover, between 2014 and 2022, the correlation between young people living at home, housing prices, and emigration seems to have increased, with higher peaks noted between the Northwest, where housing prices have significantly risen, and Southern Italy, where a high number of emigrants and young people still living with their families are observed.
Regarding graduates, the analysis reveals an increase in employment along with a rise in job satisfaction (fairly satisfied), with an average salary level that remains relatively stagnant. However, it is not clear from the collected data whether workers are employed in a field consistent with their area of study. Therefore, we cannot find a correlation with the provided data, especially when searching for the motivations behind young people's migration.

This analysis cannot be considered complete because most data related to these aspects are actually not trackable and therefore not available due to phenomena such as unreported employment. Considering only the available data, the analysis could be made more comprehensive by normalizing the data to facilitate further correlation calculations. For the purpose of the project, we have decided to propose only a graphical representation that still allows for a straightforward visualization of trends over time and possible correlations. Additionally, the project could be expanded by considering additional parameters, and especially by comparing the Italian work situation with that abroad, particularly in Europe. This approach could highlight differences in salary and working conditions that drive young Italians to emigrate.