"Cervelli in Fuga" is a project on young Italian people and their choice of leaving their home country. Often discussed in the media, the problem of emigration in Italy has various factors that need to be taken into account in order to understand what really lies behind the big number of young people choosing to live and work abroad.
The workflow was divided as follows:
This project aims to investigate what might be the possible reasons behind many young people's choice of leaving Italy. Our hypothesis is that the difficulty in gaining a proper economic indipendence and the lack of meritocracy are the main aspects that bring young Italians to leave their home country. In order to consider this hypothesis on the level of data, we have considered 16 datasets related to graduate rate employment, house prices, hourly wage and work satisfaction. The time span taken into account for this analysis goes from 2014 to 2022, as our aim is to give a full picture of young Italians' working conditions in the most recent years. Of course, this project does not aim to draw any conclusions on the matter: in fact, data is not enough to tell the full story and the fact that these aspects might or might not be correlated does not imply that anything in terms of causality. Moreover, the team of this project selected the datasets on the basis of the hypothesis previously stated: therefore, the choice of the datasets is not scientifically grounded. Within the sphere of interest, datasets were selected on the basis of availability and compatibility with one another. All these factors considered, it is once again necessary to state that this project is the elaboration of an hypothesis and does not aim to make any type of statements. Further research is needed in order to assess a conclusion, and more datasets need to be taken into account to cross-check the results.
The datasets obtained from Istat have already been filtered at the source, considering common and necessary variables to be displayed within the datasets. On the other hand, for datasets obtained from AlmaLaurea, it was not possible to filter the necessary data at the source, so it was necessary to download the entire CSV files, provided year by year. Subsequently, especially for the latter, a cleaning process was carried out to select only the necessary data and merge the data together.
The "National Guidelines for Enhancing Public Information Heritage", in particular considering the section related to Data quality, underline four among other characteristics that are necessary to point out when analyzing the goodness of data. This step is necessary in order to guarantee the "certification of the accuracy of the provided data and, above all, its suitability for the intended use." These four characteristics are:
In summary, accuracy and consistency are more or less consistently maintained regarding the data within the dataset. The same cannot be said for column headers, where syntactic inaccuracies are often found. Completeness is not guaranteed across the board, as in some cases, there is missing data for the year 2022, and all datasets from AlmaLaurea lack metadata related to the datasets. Timeliness is not always adhered to, and particularly in the case of AlmaLaurea, there is no information regarding the data publication frequency or, specifically, the last update.
Before reasling datasets to the public, it is necessary to investigate the legal framework that allows or resctrits its publication.
The following check list is an instrument to verify the legal aspects connected to the publication of data. Basing ourselves on this checklist, for each dataset we evaluated the following topics: privacy issues, IPR policy, licenses, limitations on public access, economical conditions, and temporal aspects of the dataset.
We performed the legal analysis directly on the platforms that initially hosted the datasets, in order to have a full picture of the legal status of these data. Therefore, all datasets analysed are not owned by the creators of this project. However, the datasets used for the visualizations are a result of the mesh-up between different datasets and they are owned by "Cervelli in Fuga". Therefore, all datasets processed and created during the mesh-up phase are updated to January 2024. For complete and precise information, however, the user is invited to refer to the original datasets linked in the section related to the Quality Analysis.
Data ethics involves the responsible and ethical handling of data throughout its lifecycle, from collection to processing, analysis, and dissemination.
For our analysis we took into consideration the main principles of data ethics (fairness, accountability, transparency, anonymity) based on the information provided by Data ethics: principles and guidelines for Companies, Authorities and Organizations and Data ethics framework by the UK government.
As far as we could see, no specific ethical issue was encountered, yet it is worth analysing each data source in detail:
The data from Istat is made available in various versions for download, such as CSV, JSON, Excel, and SDMX-CSV. The latter is an open standard format as it has a formal specification (XML). The other formats are not open standards, but CSV, in particular, is a format used for the exchange of tabular data. On the other hand, data from AlmaLaurea is available for download only in CSV format.
The "National Guidelines for Enhancing Public Information
Heritage", in particular considering the section related to
Metadata Model, highlight four distinct levels of metadata that progressively
convey the degree of connection between data and metadata, along
with the level of detail.
The metadata from Istat has a dual description that places them
at both level 4 and level 2. Level 2 represents a weak bond
between data and metadata and it is represented by external
metadata accessible from the SIqual portal (cases where this
information can be accessed are reported in the "timeliness"
section of the "Quality analysis" table). Level 4 is represented
by the data contained in SDMX format files that link metadata to
each individual data point. This combination allows for metadata
that is comprehensible both to humans and machines. Datasets
from AlmaLaurea do not include metadata, so they fall into level
1.
During the preliminary technical analysis, the licenses used within the original datasets are:
Considering the two different licenses used for the original datasets, we had to find a commmon ground when choosing the license for publishing the mash-up datasets. Datasets I2, I3, I5 are entirely published using CC-BY-SA-4.0, while I4 is published meshing CC-BY-SA-4.0 and CC-BY-NC-4.0 (see here). The second license is used exclusively for data originating from AlmaLaurea, which has a license that does not permit commercial distribution.
Other licenses used are:
All data and information available on this website have been published under the CC BY-SA 4.0 and CC BY-NC 4.0 licenses and are compliant with the FAIR principles.
The website is readily discoverable and accurately cataloged by search engines, ensuring optimal visibility. Additionally, every dataset generated for the conducted analysis is distinctly identified, enhancing traceability and accountability throughout the research process. This meticulous approach not only facilitates easy retrieval of information but also ensures the authenticity and uniqueness of each dataset involved in the analysis.
All gathered data is conveniently accessible both on this platform and within the corresponding GitHub Repository. Importantly, these datasets will persistently remain available and downloadable, even in the event that the original sources decide to remove them in the future. This commitment to sustained accessibility underscores our dedication to preserving and sharing valuable information, fostering transparency, and enabling continued utilization for research and analytical purposes.
The combined datasets are meticulously characterized in accordance with the metadata guidelines outlined in the DCAT-AP version 2.0.0 specifications. These guidelines are meticulously designed to ensure interoperability with other widely recognized and utilized formats such as SKOS, FOAF, PROV-O, and DCTerms. This systematic adherence to established standards not only enhances the coherence and structure of the metadata but also facilitates seamless integration with diverse platforms and tools, promoting broader accessibility and utilization of the mashed-up datasets across different domains.
The information presented on the website is available for unrestricted reuse, subject to compliance with the specified output license. Users are encouraged to freely utilize and repurpose the data in accordance with the terms outlined in the license agreement. This ensures that individuals and entities have the freedom to apply, share, and adapt the published data, fostering a collaborative and open approach to the dissemination of information.
As of the finalization of our project for the course "Open Access and Digital Ethics" at the University of Bologna, for the academic year 2023/2024, there are no plans for future updates to the resource we have developed. It is essential to note that the original datasets we have analyzed are regularly updated on an annual basis by their source, establishing it as the most dependable information repository in this domain. Therefore, we recommend that interested users refer to the platform directly for any subsequent changes or updates to the data in the coming years.
The dataset mash-up was carried out in two phases: the creation of intermediate datasets and then the creation of the final mashed datasets.
The creation of intermediate datasets was necessary to
facilitate the subsequent combination of datasets, particularly
considering the initial data from AlmaLaurea. As mentioned
earlier, it was necessary to download the datasets in their
entirety from the source, as filtering was not possible.
Additionally, the CSV file was not formatted correctly,
requiring a cleaning process where only a specific row of the
dataset related to the employment of individuals 3 years after
obtaining a master's degree. In this case, the intermediate
dataset allows the consolidation of the previous 9 datasets into
a single dataset, bringing together all the data collected over
the 2014-2021 timeframe.
The second intermediate dataset combines data on house price
indices and the percentage of young people still living with
their parents. This dataset was created as an indicator of
economic independence and is necessary for conducting the final
analysis presented in the last dataset mash-up.
The process of creation of the four different datasets is explained in the Jupyter Notebook using both text and code.
Data visualization is a key step in the process of turning raw data into information understandable to everyone. That is why we processed our main datasets resulting from the mash-up into visualization charts. During this process, we encountered some difficulties primarily given by the fact that within the same datasets there was data expressed in different measuraments units (percentages, units of people, euros). However, the problem was solved by making sure that the data did not mix in the visualization (for example, by referring to the same y axis). Out of four visualizations, three are charts that aim to show the evolution in time of a specific phenomenon. For the fourth visualization, we chose to show the differences by highlighting the role played by the geographical area on various phenomena.
Visualization tools
The produced mashed-up datasets have been described in their metadata specification following DCAT AP version 2.0.0, an RDF vocabulary designed to facilitate interoperability between data-catalogs published on the Web, and the resulting Turtle serialization can be accessed by selecting the item of interest in the following list:
The datasets have been analysed both as single individual dcat:Dataset and as a whole as a dcat:Catalog gathering them all. The main used metadata properties have been summarized in the next table, with specific attention at including both mandatory elements and the highest number of reccomended and optional information as possible to meaningfully enrich the collection.
Datasets | Catalogue | |
---|---|---|
Identifiers | adms:identifier, dcterms:title | adms:identifier, dcterms:title |
Description | dcterms:description, dcat:keyword | dcterms:description, dcat:keyword |
Temporal | dcterms:issued, dcterms:modified | dcterms:issued, dcterms:modified |
Spatial | prov:wasDerivedFrom | |
Composition | dcterms:hasPart | dcat:datasets |
Agents | dcterms:publisher, dcterms:creator | dcterms:publisher, dcterms:creator |
Legal | dcterms:rightsHolder, dcterms:license | dcterms:license |
Distribution | dcat:distribution | |
Language | dcterms:language | dcterms:language |
Web | dcat:Landingpage | foaf:homepage |
We've conducted thorough research on the status of young people in Italy in regards to their working conditions and satisfaction, in order to investigate the reasons that might be behind the high level of emigration in this country. Our approach involved gathering and analyzing data from various sources, such as Istat (Istituto Nazionale di Statistica) and AlmaLaurea. By combining and examining data from diverse origins, we aimed to gain an thorough understanding of this issue.
To address the competency questions presented at the start of the project:
This analysis cannot be considered complete because most data related to these aspects are actually not trackable and therefore not available due to phenomena such as unreported employment. Considering only the available data, the analysis could be made more comprehensive by normalizing the data to facilitate further correlation calculations. For the purpose of the project, we have decided to propose only a graphical representation that still allows for a straightforward visualization of trends over time and possible correlations. Additionally, the project could be expanded by considering additional parameters, and especially by comparing the Italian work situation with that abroad, particularly in Europe. This approach could highlight differences in salary and working conditions that drive young Italians to emigrate.