Glitter or Gold? Deriving Structured Insights from Sustainability Reports via Large Language Models

Over the last decade, several regulatory bodies have started requiring the disclosure of non-financial information from publicly listed companies, in light of the investors' increasing attention to Environmental, Social, and Governance (ESG) issues. Such information is publicly released in a variety of non-structured and multi-modal documentation. Hence, it is not straightforward to aggregate and consolidate such data in a cohesive framework to further derive insights about sustainability practices across companies and markets. Thus, it is natural to resort to Information Extraction (IE) techniques to provide concise, informative and actionable data to the stakeholders. Moving beyond traditional text processing techniques, in this work we leverage Large Language Models (LLMs), along with prominent approaches such as Retrieved Augmented Generation and in-context learning, to extract semantically structured information from sustainability reports. We then adopt graph-based representations to generate meaningful statistical, similarity and correlation analyses concerning the obtained findings, highlighting the prominent sustainability actions undertaken across industries and discussing emerging similarity and disclosing patterns at company, sector and region levels. Lastly, we investigate which factual aspects impact the most on companies' ESG scores using our findings and other company information.


Introduction
Public health, climate change, social inequalities, diversity, and inclusiveness are challenges that need global attention as well as innovative and collaborative solutions.However, building a sustainable society requires defining a common set of sustainable-related issues to disclose, measure and comply with.ESG, which stands for Environmental, Social, and Governance, is an established set of principles used to monitor the sustainability and ethical practices of businesses within society.These three E/S/G aspects are further described via more granular indicators (both qualitative and quantitative) concerning, for example, waste management, emissions, labour rights, and diversity.These indicators aid in evaluating the degree to which a corporation contributes to achieving societal goals.Assessing these ESG aspects can also help monitor the progress of the seventeen Sustainable Development Goals (SDGs) included in the United Nations' 2030 Agenda for Sustainable Development [1] which sets ambitious goals for building a sustainable society such as gender equality, responsible consumption and production, and climate action.
Over the last decade, there has been a growing demand for disclosing companies' nonfinancial information.This demand comes from legislation such as the European Union's Non-Financial Reporting Directive (NFRD, [2]) which requires all public-interest companies with more than 500 employees to do so.The more recent European Union's Corporate Sustainability Reporting Directive (CSRD, [3]) further increases this demand by enlarging the pool of companies concerned by a factor of 4: from roughly 12 thousand to 50 thousand companies [3].
Non-financial disclosures are typically reported in sustainability reports, web pages, social media posts, news, press releases or earning calls.To overcome this variety of sources, stakeholders generally rely on a third-party assessment of corporations' ESG performance to inform their decisions: ESG ratings provided by agencies such as Sustainalytics, MSCI, S&P Global, Moody's and Refinitiv [4].These rating agencies rely on proprietary assessment methodologies with different perspectives on the measurement, scope and weight of different ESG aspects.This creates divergences in companies' evaluations across agencies and thus unsatisfactory degrees of explainability, transparency or fairness [5][6][7][8].Stakeholders might overcome this issue by directly accessing non-financial information and imposing their scope and weight to assess corporates' ESG performance [4,9].However, extracting meaningful insights from ESG-related data sources can be challenging and laborious, often including lengthy documents.Consequently, the stakeholders can face significant obstacles in evaluating companies' ESG performance from opaque and divergent assessments to verbose textual documents.We posit that a data-driven approach, coupled with state-of-the-art Natural Language Processing (NLP) techniques, can provide automatic tools to extract insights from companies' disclosures such as sustainability reports.Further, the proposed methodology allows us to better understand the companies' ESG assessments and to unveil relationships between what companies disclose and their ratings.
The purpose of this work is to automatically extract and analyse the ESG initiatives disclosed by companies in their sustainability reports, and to investigate how these impact their ESG performance assessment.Table 1 presents the definitions of the ESG-related terms adopted henceforth.Our proposed methodology relies on Large Language Models (LLMs, [10][11][12]) for Information Extraction purposes, on graph-based representations for data analysis, and on the SHapley Additive exPlanations (SHAP, [13]) framework for the interpretability of the ESG scores.Specifically, we employed a generative LLM to extract structured insights from companies' sustainability reports, bipartite graphs 1 to conduct analyses on them, and the SHAP framework on linear regression to investigate the impact of companies' disclosures on ESG scores.Turning the unstructured information from sustainability reports into a structured and unified format allows us to build graph-based representations; these, in turn, are directly usable and exploitable for a diverse set of tasks, from exploring and navigating the data to discovering emerging patterns via statistical and interpretability analyses.
As mentioned before, for this first Information Extraction step, we employ an instruction-finetuned generative LLM (WizardLM, [16]), leveraging both In-Context Learning (ICL, [17]) and Retrieval Augmented Generation (RAG, [18]), to extract ESG-related actions as triples from companies' sustainability reports.LLMs have consistently been shown to hold semantic understanding and store factual knowledge [18][19][20][21]; Instructiontuning (that is, providing task descriptions via natural language instructions) further enhances their capabilities to address downstream NLP tasks, such as Information Extraction [10,16,[22][23][24]; LLMs have also demonstrated a remarkable ability as few-shot learners using In-Context Learning [17], a technique that relies on providing a few input-output samples within the model instruction.
The triple format allows us to represent the document sentences in a standardised format following a pre-defined semantic template tailored to ESG aspects.These extracted triples are then used to build a Knowledge Graph (KG), following the same research direction of recent studies based on OpenAI's commercial LLMs [25][26][27][28].This allows us to condense all the companies' disclosures through a graph representation which offers a versatile approach to illustrate structured information [29] as concepts (nodes) and their relationships (edges) [30,31].The generated KG is then decomposed into several bipartite graphs, thus two-fold graph representations, to analyse the companies' disclosures by inspecting the extracted information from various angles, including the company and topic perspective.
Our findings include descriptive, similarity and correlation analyses concerning ESGrelated actions disclosed by companies in their sustainability reports.These analyses unveiled that companies addressed ESG topics from several perspectives, spotlighting the complexity of ESG-related matters.In addition, similarities in companies' disclosures emerged from companies from the same geographical region and the same sector, remarking the possible influence of exogenous factors [32,33] and the presence of sector-focused topics [34][35][36].Further, our interpretability analysis of ESG scores highlights how a company's disclosures impact its ESG rating more than other company-specific or financial aspects.Finally, our analyses show the rewards of transparency: comprehensive disclosure of non-financial information appears to affect ESG scores positively, whereas, conversely, reporting on a limited set of ESG categories seems detrimental; interestingly, we also observe a significant incidence of other factors not directly related to ESG, such as a company's incorporation year and continent (Europe).
Overall, our work contributes to the literature on sustainability and Corporate Social Responsibility (CSR) by proposing an advanced NLP pipeline for automatically extracting information from companies' sustainability reports and validating some ongoing hypotheses using a data-driven approach and the company perspective.

Related work
In this section, we first discuss the state-of-the-art approaches for creating Knowledge Graphs (KGs, Sect.2.1), encompassing a spectrum ranging from conventional NLP pipelines to the exploitation of Large Language Models (LLMs).In Sect.2.2, we then summarise the main studies focused on ESG-related textual information.Lastly, we present ongoing discussions within the sustainable finance literature in Sect.2.3.

Knowledge graphs generation
Knowledge Graphs (KGs, [30,31]) offer a versatile method of representing knowledge that can be leveraged in various use cases and domains [37,38].They can be applied to question-answering [39], recommendation systems [40], and information retrieval [29].The task of KG generation, also known as knowledge acquisition, aims to create KGs by extracting information from unstructured, semi-structured or structured sources as well as augmenting existing graphs [38,41,42].Traditional approaches for knowledge acquisition involve several NLP tasks which are generally disjointly learned, a process that is prone to error accumulation.To overcome this problem, new one-stage NLP pipelines have been proposed to jointly extract both entities and relations [43,44].
In this context, Open Information Extraction (OIE, [45]) emerged as the task of extracting structured information, typically in the form of subject-predicate-object (SPO) triples, without relying on a predefined template or a specific domain.These SPO triples can then be leveraged to generate knowledge graphs based on the subjects, predicates and objects extracted from textual documents.This approach can mitigate the impact of depending on external knowledge, such as patterns and domain-specific heuristic rules present in the training data.Recently, OIE models (e.g., Multi 2 OIE [46] and DeepEx [47]) have employed transformer-based LLMs (e.g., BERT [48]) to extract both syntactic and semantic information [49].
We follow this promising research direction by exploiting the semantic understanding, generative abilities and flexibility of LLMs to extract ESG-related structured information.Our methodology adopts a Large Language Model, alongside the Retrieval Augmented Generation (RAG) paradigm [18] and the In-Context Learning (ICL) technique, to overcome the main limitation of conventional OIE methods in achieving our goal of extracting ESG-related structured insights: OIE methods traditionally extract structured information by relying only on the syntactical structure of a sentence, without any predefined template, which poses important limitations in the type of information extracted.
They may fail to extract, for example, information related to specific domains or activities that are not the direct subject of the sentence.An information extraction approach based on generative LLMs allows us, instead, to generate semantically-aware and ESGoriented triples instead of traditional SPO triples, enabling the creation of a full-fledged ESG-oriented KG.

Text analytics on ESG-related information
Several works have explored the use of NLP technology to process companies' nonfinancial textual information and extract meaningful insights concerning statements, facts and actions disclosed by companies.For example, Chou et al. [53] applied distributional semantics on 10-K filings (i.e., annual financial reports required by the U.S. Securities and Exchange Commission) for extracting topics related to climate change that are disclosed by companies, aiming to monitor environmental policy compliance.Raghupathi et al. [54] applied traditional approaches ranging from trigram co-occurrences to clustering and classification to gain insights into the shareholders' perspectives and objectives regarding sustainability and climate change.LLMs have also been leveraged for Text Analytics in the ESG domain.Jacouton et al. [55,56] introduced SDG Prospector, a tool exploiting LLMs to identify paragraphs in Public Development Banks' sustainability reports that address SDGs.Similarly, Webersinke et al. [57] introduced ClimateBert, a transformerbased language model fine-tuned for climate-related classification tasks.Vaghefi et al. [58] adopted the paradigm of RAG [18] to enrich GPT-4 with the ability to reliably answer climate-related questions, by augmenting the posed questions with contextual information retrieved from the Sixth Assessment Report, released by the United Nations Intergovernmental Panel on Climate Change (IPCC).The authors also released a conversational agent [59] based on their proposed approach.Ni et al. [60], developed ChatReport, an LLM-based tool that evaluates companies' sustainability reports according to the eleven recommendations [61] provided by the Task Force for Climate-Related Financial Disclosures (TCFD).The tool combines semantic search to identify text chunks that are pertinent to each recommendation and LLM prompting to summarize them.
Our work differs from existing approaches by jointly (i) leveraging generative LLMs for Knowledge Graph (KG) generation, (ii) employing an open-sourced generative LLM and (iii) relying directly on companies' sustainability reports.This methodology, alongside the exploitation of bipartite graph representation, allows us to conduct non-trivial analyses concerning ESG categories and actions disclosed by companies.

Sustainable finance
Companies' non-financial disclosures might be influenced by the diverse regulatory requirements specific to different regions [62], as well as by factors such as the political, labour, and cultural landscape of countries they operate in [32,33,63].
European, American and Asian companies could address their Corporate Social Responsibility (CSR) by prioritising different socially responsible efforts, investments, and disclosures based on the demands coming from their region.For instance, Baldini et al. [33] found that a country's labour union density positively impacts social and governance disclosure, whereas Yu et al. [32] unveiled that the lack of political rights has a negative influence on ESG disclosure.On the other hand, CSR might also be influenced by regulatory agencies [62].For example, European companies might prioritise environmentalrelated issues due to more stringent climate regulations in Europe, such as the European Union's Emissions Trading System [64] or the ambitious "Fit for 55" plan [65] recently proposed by the European Union to achieve climate-neutrality by 2050.
This greater European commitment towards environmental aspects might also cause biases in benchmarking companies' environmental performance.For example, the study of LaBella et al. [66] unveiled the important role of ESG rating agencies and their biases towards European companies.The authors [66] discovered a bias among rating agencies when evaluating ESG performance, showing a preference for European companies over their North American, Emerging Markets, and Developed Asian counterparts.They proposed that this bias could stem from variations in formal reporting requirements across different jurisdictions, contributing to differences in the quality of companies' non-financial disclosures [66].This indirectly spotlights the pioneering and benchmark role of European companies concerning non-financial disclosures and CSR.Additionally, there are other studies [63,66,67] that suggest non-financial disclosures, and indirectly, companies' ESG assessments, might also suffer from biases based on company size.This is because generating non-financial disclosures can be both financially and labourintensive [66].As a result, larger companies could invest more economic and human resources in improving their non-financial transparency, positively influencing their ESG evaluation.
On the other hand, the company's ESG performance is traditionally assessed relative to industry peers due to industry-specific ESG concerns.This allows rating agencies (e.g., Refinitiv) to assess a company's ESG performance by outweighing the ESG topics relevant to the company's industry.For instance, packaging could be more relevant for companies producing consumer staples (e.g., the beverage industry), while materials companies (e.g., the chemicals industry) might be affected by physical-related topics such as Employee Safety.This factor has already emerged in the ESG literature [34][35][36] and is named ESG Industry Materiality.In the ESG-related debate [68,69], materiality generally identifies two types of sustainability issues in an industry: financial materiality refers to issues that must be disclosed due to their potential significant effects on financial performance, and impact materiality pertains to information on the impact of a company's activities on the surrounding ecosystems and social systems.Whereas, reporting standard organizations, such as the Sustainability Accounting Standards Board (SASB), 2 identify environmentally, socially or financially relevant issues for a specific industry (known as materiality matrix) that necessitate disclosure, such as Water Management within the Non-Alcoholic Beverages industry.This aids companies in enhancing their disclosure of pertinent subjects within their industry.Our data-driven methodology enriches these ongoing discussions in the literature by providing structured insights on ESG topics directly extracted from companies' non-financial textual disclosures (Sect.4).This facilitates a deeper examination of companies' ESG initiatives compared to the prevalent dependence on proprietary assessments and tools in current literature.In the discussion section (Sect.5), we leverage these data-driven insights to extend upon existing debates in the literature, concerning, for example, the diversity of actions taken by companies, as well as correlations across sectors and geographic regions.

Materials and methods
This section discusses the data, the approaches, and the methods used in this work.First, we describe the data sources (Sect.3.1); second, we provide a detailed overview of our approach, from data preparation (Sect.3.2) to triple generation (Sect.3.3) and KG generation (Sect.3.4).Finally, we discuss the methods and approaches used to analyse, compare and evaluate the findings concerning the generated triples (Sect.3.5).

Sustainability reports
Sustainability reports are non-financial documents published by companies to disclose information concerning the impact of their activities on the environment and people.Therein are described the actions the company took or expects to take regarding ESG matters -such as respect for human rights, fair treatment of employees, anti-corruption and bribery as well as board diversity [3].However, ESG reporting can be subjective and opaque due to the complexity of reporting qualitative aspects, particularly those related to social and governance issues [70].Furthermore, the lack of a standardised framework for ESG reporting makes quantitative/comparative analyses difficult [71,72].
We initially collected 6456 sustainability reports from 4222 different companies using the report URLs available on two public websites [73,74].Although these sustainability reports are mainly written in English (94% of all the available documents), the nationality of the companies is fairly diversified, covering 74 different countries.However, the majority of the available reports (56%) come from North American companies.For our study, we consider only the reports written in English because of its broad coverage and the wide range of pre-trained language models available for this language (many thousands of language models on the Hugging Face platform [75]).
Concerning the period covered, we gather reports up to fiscal year 2022 (9.6% of the available documents), even though the majority of the documents (54%) refer to the fiscal year 2021 (i.e. a fiscal year is a twelve-month period that is generally equal to a calendar year).This temporal distribution displays an expected coverage since we gathered these sustainability reports in February 2023 with the majority of the reports published throughout 2022 disclosing information concerning the previous fiscal year (2021).
Nevertheless, processing more than six thousand sustainability reports from four thousand diverse companies poses a significant computational challenge, especially when using resource-intensive generative LLMs that require a significant increase in computational budget.To showcase our pipeline and present meaningful insights, we consequently select a representative subset of companies by balancing sector, region, company notoriety, size and age representation.For the notoriety criteria, we subjectively consider wellknown companies with high capitalization (e.g., NVIDIA), low ESG-related performance (e.g., Saudi Aramco) and recent controversies (e.g., First Republic Bank).This subset compounds to 124 companies spanning 3 continents, 15 countries and 11 GICS (Global Industry Classification Standard)3 sectors (Table 2).The sample covers companies established For this study we only considered the most recent sustainability report for each company in the selected subset, so to avoid any skewness effect.Note that the proposed methodology can be used as-is for longitudinal studies by simply using sustainability reports referred to several years.
Further details concerning the distribution of the fiscal years considered for this work, the complete list of the considered companies and the original dataset can be found in the Supplementary Material (SM) document (see Additional file 1, Sects.SM1.2, SM1.3 and SM1.4).

ESG categorization
We adopt the ESG categories proposed by Berg et al. in [8].The authors grouped, using a bottom-up methodology, nearly seven hundred ESG-related indicators from six distinct ESG data providers (Sustainalytics, S&P Global, Refinitiv, Moody's ESG, Morgan Stanley Capital International-MSCI, and MSCI-KLD) into a unified categorization comprising sixty-four ESG categories.These categories encompass environmental, social and corporate governance issues including Employee Development, Supply Chain, Climate Risk Management, Energy, Financial Inclusion, Biodiversity, Customer Relationship, Access to Basic Services, and Board Diversity.A complete list of all the ESG categories can be found in the Supplementary Material file (see Sect.SM1.1).

ESG ratings and other company information
Rating agencies such as Refinitiv, MSCI, and Sustainalytics utilise non-financial reports and ESG-related information to systematically assess the impact of companies' activities on the environment and society.This assessment, typically done through numerical scores, offers stakeholders valuable measures for evaluating and comparing companies in terms of their performance related to ESG topics.
Among those, we used the Refinitiv platform, 4 which provides high-quality financial and ESG data.The Refinitiv ESG ratings are given as percentage scores [76], wherein lower values (0-25) indicate poor ESG performance and a lack of transparency in publicly reporting material on ESG data (i.e., laggard companies); conversely, higher values (75-100) indicate excellent ESG performance and a high level of transparency in publicly reporting material on ESG data (i.e., leader companies).In addition, it is worth mentioning that a zero score is assigned in the rare case that a company does not disclose any metrics or information relevant to its industry [9,77].The combined ESG score is a cumulative measure of E/S/G pillars' weights, which differ by industry for the first two pillars (E and S), while the weight of the third pillar (G) remains consistent across all industries [76].A company's ESG performance is indeed assessed relative to industry peers due to industry-specific ESG concerns (Sect.2.3).
For each company, we collected twelve company features encompassing ESG scores, unchanging company details and annual financial data (regarding the same fiscal year of the considered sustainability reports).Specifically, we gathered: the combined ESG scores, individual scores for the E/S/G pillars, company sector, industry, country, region and continent, number of employees, market capitalization, EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization), and total liabilities.

Data preparation
Our NLP pipeline consists of several components (Fig. 1), including some pre-processing methods, to extract structured insights from sustainability reports.
In this section, we describe the NLP methods adopted to prepare the data for our subsequent analyses.These pre-processing methods include extracting text from PDF files and segmenting sentences (Sect.3.2.1)as well as using semantic search to select only sentences related to ESG topics (Sect.3.2.2).The latter relies on the ESG categorization introduced in Sect.3.1.2.

NLP pipeline
Sustainability reports are generally visually rich and lengthy PDF documents, for example, our sample has a median page value equal to 61 pages.Unfortunately, the incentive of companies to present visually appealing infographics and tables, results in degraded quality of standard text extraction tools.
Hence, for the textual part, we rely on a PDF parser (PyMuPDF, [78]) and apply standard preprocessing steps to improve the quality of the extracted text and reduce the artefacts generated by the parser.Specifically, regular expressions are used to add a full stop between Figure 1 Our proposed approach and its components.Given a collection of textual documents as inputs and preprocessed using different NLP methods, semantically structured insights are extracted by the retrieval-augmented triple generator.Bipartite graphs are created after performing a semantic-based triple clustering two sentences when missing, remove new lines in the middle of sentences and remove duplicate white spaces and lines.
Processing textual data also requires defining the granularity of the input data according to purpose, needs and limitations.A text corpus contains textual items representing singular tokens, words, sentences, paragraphs, or entire documents.We adopt sentence-level textual inputs as a good trade-off between the semantic meaning conveyed by a sentence and technical limitations (e.g., the maximum prompt length of the language model).Consequently, after extracting textual data from the sustainability reports, we decompose each report into sentences with PySBD [79,80], a tool widely considered the state of the art in Sentence Boundary Disambiguation (SBD).

Asymmetric semantic search
Sustainability reports include generic and vague statements, for example, phrases such as "Air is something that surrounds us 24 hours a day". 5Accordingly, a filtering process is required to consider only ESG-related sentences for downstream tasks.The two most well-known filtering methods for information retrieval are keyword-based and vectorbased search [81].We adopt the approach of neural semantic search [82][83][84][85], a vectorbased search method, that exploits text embeddings to represent both documents and queries in the same vector space.Relying on text embeddings allows us to move towards a semantic-oriented filtering approach, reducing the dependency on single keywords, and thus on the ESG categorization adopted.This representation allows one to measure the semantic similarity between a document and a query by simply computing the distance, and contrarily the similarity, between their corresponding embedded vectors [82].
Our retrieval task involves discovering sentences related to each of the ESG categories (Sect.3.1.2)within each sentence-segmented sustainability report.This implies working in a setting of asymmetric semantic search in which queries (ESG categories) and corpus documents (company sentences) are not interchangeable as they represent different semantic object types and have different lengths; similarly to the question answering framework [83,84].In contrast, symmetric semantic search is adopted when queries and corpus documents are interchangeable such as in similar document retrieval systems [86][87][88].After extensive testing, we determine that "INSTRUCTOR-xl" [89], an instruction-tuned embedding model [90], is the most suitable choice for our specific tasks.The authors of the model [90] offer a universal instruction template ("Represent the [domain] [text type] for [task objective]:") along with an example list.Since we deal with asymmetric semantic search, the instructions provided to the model vary depending on the type of input.For embedding the ESG categories (queries), we use the instruction "Represent the title for retrieving relevant statements".To embed the sentences (corpus documents), the instruction employed is "Represent the statement for retrieval".
After generating the text embeddings for a sentence-segmented sustainability report, we retrieve the most semantically relevant sentences for each ESG topic (semantic search, Table 3).We set the similarity cut-off threshold t sim equal to 0.6 to retrieve relevant sentences to a given ESG topic.Empirical experiments show an acceptable sentence relevance with a similarity threshold equal to or above 0.4.Thus, we adopt a cut-off point of 0.6 as a good trade-off between sentence coverage and computational workload.In addition, we Through continuous publicity and education on garbage classification, the project wastes were gradually reduced, recycled and harmless . . .consider the top k sentences (with k = 30) to limit the number of retrieved sentences following the aforementioned trade-off.This filtering layer also helps us reduce the computational time of the follow-up steps (e.g., the inference of the generative language model) and prune the resulting KG.

Retrieval-augmented triple generation
Our work aims to create a KG connecting companies, ESG topics, and actions disclosed by companies related to those topics.To achieve this goal, we need to represent ESG-related sentences in a unified and standardised format: triples with a predefined semantic template.Precisely, each ESG-oriented triple (cat-pred-obj) should consist of an ESG category 6 (cat) representing an ESG topic mentioned directly or indirectly in the sentence, a predicate affecting that category (pred), and an entity (obj) related to the ESG category undergoing the predicate.We define an action (act) as the concatenation between the ESG category (cat) and the predicate (pred) of a cat-pred-obj triple.
Consequently, our goal requires knowing the semantic meaning of words as well as defining a semantic template to generate ESG-oriented triples.The latest OIE techniques (Sect.2.1) incorporate semantic information for extracting structured information, yet they rely only on the syntactical structure of the sentence.For example, given the sentence: "Microsoft has invested 125 million in cutting-edge recycling technologies", 7 conventional OIE techniques [91] would identify and generate a traditional SPO triple as the following: (Microsoft, Invested, 125 million).Although the above SPO triple can well represent the semantic meaning conveyed, it would not be suitable for our goal.Indeed, the ideal triple would have been: (Waste, Investment in, Cutting-edge recycling technologies).
Generating the latter requires defining a semantically-aware triple template.Firstly, the entity Waste, representing an ESG category, is not explicitly mentioned, although it could be inferred from the term recycling technologies.This type of inference jointly involves information extraction and semantic classification tasks.Secondly, our goal requires extracting ESG-related actions rather than generic statements.Hence, triples should envelop predicates and objects related to an ESG category.For instance, given that ideal triple, the action (act) is defined as the ESG category Waste concatenated with the predicate Investment in, resulting in the action "Waste:Investment in".
LLMs have already demonstrated abilities in semantic understanding and handling a broad range of NLP-related tasks [10,22].Accordingly, in this work, we employ instruction-tuned LLMs, the In-Context Learning technique and the prominent RAG paradigm [18] to address this challenge.Our work exploits these techniques to provide an LLM with an input (ESG-related sentence) and an external context (input-output examples and a semantic schema) to extract structured information from the sentence.
We choose the Kor library [92] to create in-context instructions for LLMs.Kor allows to programmatically construct prompts by specifying the semantic data schema for the ideal triples (cat-pred-obj) as well as including labelled examples.A labelled example connects an input sentence with the desired output, an ESG-oriented triple.Figure 2  In the model instruction, each element of the triple (cat, pred and obj) is declared with a unique name and a natural language description conveying its semantic meaning.The Kor library then uses this information to generate, by means of a predefined template (Fig. 3), a textual instruction to prompt an instruction-tuned LLM.We integrated the sixty-four ESG categories of the ESG categorization (Sect.3.1.2) in the description of the attribute cat.The LLM leverages this list of ESG topics as semantic guidance, aiding itself in the generation of the ESG category for a given input sentence.It achieves this by mimicking the results of a supervised text classifier, whose labels are those of the ESG categorization, and extrapolating those labels with semantic generalization.Sect.SM2.1 of the Supplementary Material (SM) document exhibits the full model instruction.
We tested different instruction-tuned LLMs such as Google's Flan-T5 [93] and LLaMAbased models [94].We empirically found that LLaMA-based models (e.g., Alpaca [95]) generate better results when prompted to extract structured information, with WizardLM-7B [16,96] producing the highest-quality results.Appendix C and Appendix D showcase additional information regarding empirical experiments conducted on different LLMs, as well as the selection of specific hyperparameters for the LLM generation process.
Figure 3 The instruction template used to prompt the generative LLM.The instruction template is created and compiled by the Kor library to prompt an LLM to extract structured data using In-Context Learning.DATASCHEMA is a placeholder for the output data schema.EXAMPLES is a placeholder for the labelled examples in the format of input-output pairs.While INPUT represents an ESG-related sentence from which structured data needs to be retrieved

Knowledge graph generation
Before constructing a Knowledge Graph (KG) using the generated triples cat-pred-obj, we apply a data-cleaning process to reduce data redundancy.The redundancy in the KG comes from nodes and edges representing similar concepts and their relationships multiple times.
To achieve this goal, we perform semantic clustering on all the ESG categories (cat) and predicates (pred) included in the generated triples.Firstly, we generate text embeddings using the "INSTRUCTOR-xl" embedding model [90] with the model instruction "Represent the title".Secondly, semantic clusters are discovered as high-density regions in the embedded vector space using cosine similarity as a metric.We conducted several empirical experiments to evaluate the cluster goodness using different similarity cut-off thresholds, ranging from 0.5 to 0.9.Eventually, we adopt a similarity cut-off point of 0.8, as it strikes a good balance between the semantic coherence of the cluster elements (cluster quality) and the cluster sizes.
Finally, we label each cluster with its centroid and use cluster labels to replace the original ESG categories and predicates of the original triples.For instance, the predicate cluster labelled Partnership with groups 103 different predicates encompassing Working together with, Partnering with others to and Collaborating of.The Supplementary Material file reports further examples of these clustering operations (Sect.SM2.2). Figure 4 exhibits, for explanatory purposes, a portion of the KG generated using our methodology.

Approaches for statistical analyses
In the Results Sect.(Sect.4), we mostly deal with undirected bipartite graphs obtained from the original KG.A bipartite graph is a graph whose vertices can be divided into two distinct and independent sets or partitions [14,97,98].It can be described through its binary bi-adjacency matrix B, a {0, 1} n×m matrix where n and m are the numbers of nodes in Figure 4 An example of a portion of the Knowledge Graph generated using our methodology.It portrays the ESG-oriented triples generated using our approach.Blue nodes represent company nodes which are connected to the ESG categories (green nodes) disclosed in the companies' sustainability reports.Category nodes (cat) are connected via a labelled edge (pred) to the predicate object (obj, grey nodes) the two partitions.A bipartite graph can consequently be seen as a special type of knowledge graph whose nodes can be divided into two distinct and independent partitions.Its graph edges are accordingly context-dependent and change based on the perspective used to generate the bipartite graph.
Specifically, we create three distinct bipartite graphs for the analyses of our findings through node and edge filtering of the comprehensive KG; though isolating distinct types of nodes, and their relative connections, from the original graph.The creation of different two-fold representations (bipartite graphs) can help conduct downstream analyses on specific relationships among different types of nodes included in the comprehensive knowledge graph.This allows us to analyse the extracted insights using three distinct perspectives: 1. the predicates (pred) disclosed with each ESG category (cat): analysed using the category-predicate bipartite graph B catpred ; 2. the ESG categories (cat) disclosed by each company: analysed using the company-category bipartite graph B cocat ; 3. the actions (act) disclosed by each company: analysed using the company-action bipartite graph B coact .A table encompassing the number of partition nodes, the number of edges and the density for each bipartite graph is provided in the Supplementary Material (see Sect.SM3.2.1).

Bipartite graph statistics
Most of the unipartite graph metrics can be extended to the bipartite case [14,98].Specifically, here we compute the bipartite variants of network statistics such as degree centrality, closeness centrality and betweenness centrality [98,99].
The degree centrality of a partition node is the fraction of the nodes of the other partition connected to it [100].The closeness centrality of a node [97,98] is determined by calcu-lating its average shortest path distance to all other nodes.It represents the efficiency of a node to be connected directly to nodes from the other partition and indirectly to nodes from the same partition [101].For instance given B cocat , a company node with a high closeness score indicates the company is connected, and thus it is close, to many category nodes which in turn are connected to several other company nodes.Lastly, betweenness centrality [97,102] assesses the level of influence a node holds over information flow within a graph.In the context of bipartite graphs, it identifies nodes serving as critical mediators in enabling interactions between the two separate node partitions [14].

ESG-related actions' variability
We leverage information theory to assess the entropy of ESG topics based on their associated predicates yielded from companies' sustainability reports.Specifically, we adopt Shannon's entropy (Equation ( 1), [103]) to measure the information, and thus the variability, present in a set of events X through their respective occurrence probabilities p(x): ( In our context, the events X are the predicates disclosed by all companies for a given ESG topic, while p(x) represents their relative occurrences.High entropy denotes high variability in the predicate occurrences, indicating an ESG topic addressed through many actions (predicates) with an almost uniform probability of predicate occurrence.On the other hand, low entropy indicates the predominance of a limited set of predicates for an ESG topic.

Similarity analysis
We estimate company similarities based on jointly disclosed ESG-related actions through the Jaccard similarity coefficient [104], which measures the similarity between two sets as the cardinality of their intersection over the cardinality of their union: where A c i is the set of ESG-related actions disclosed by company c i .To mitigate the influence of stochastic fluctuations on the similarity score, we generated a null model [105] with a bootstrapping technique by computing company similarities on the randomised action sets through 1000 simulations, and substracted this null model from the observed company similarities.

Correlation analysis
We evaluate whether company similarities in terms of jointly disclosed ESG-related actions (Sect.3.5.3)are correlated to similarities in ESG scores or other company characteristics such as market capitalization or geographical location.We first measure feature similarities through different strategies, ensuring the same numerical range and monotonicity.The similarities in numerical features, such as ESG scores, are measured by computing the absolute numerical difference normalised using max-min scaling [106].While, similarities in textual features, such as company sectors, are first embedded using the "INSTRUCTOR-xl" embedding model [90], and then their semantic similarities are assessed through the cosine similarity normalised in [0, 1] using min-max scaling [106].
A complete list of all the features and measures used can be found in the Supplementary Material (see Sect.SM2.4).Afterwards, we perform a bivariate analysis through a correlation analysis to measure the monotonic association between action similarities and similarities of other company features.We rely on Kendall's τ correlation coefficient (Equation (3), [107]), a nonparametric and rank-based statistic computed as: where n c and n d are the numbers of concordant and discordant pairs respectively.Rankbased correlation methods overcome some limitations of traditional correlation methods such as the well-known Pearson correlation coefficient [108]: they can measure nonlinear monotonic relationships, are more robust against outliers and normality assumption is not required [109].High positive coefficients express a high level of order consistency in the company similarities sorted according to actions' and other similarities, while high negative coefficients occur when these two similarities are sorted reversely [107].

Interpretability of ESG scores
Lastly, we investigate the interpretability of ESG scores through linear regression and the SHAP (SHapley Additive exPlanations) framework [13].Here, we investigate the most impacting factors on the ESG scores of companies by exploiting the interpretability of a firstorder linear regression model.The model predictors are based on our findings and other company information (Sect.3.1.3).We first use as predictors the percentage of the top ten most disclosed ESG categories for each company.For example, if a company has ten percent of its generated triples concerning the ESG category Waste, and that is within its top ten most disclosed topics, the feature Category:Waste for this observation has a value of 0.1.We also consider the proportion of the E/S/G pillars based on all the disclosed categories for each company.In addition, we compute the category and action entropy for each company, indirectly representing the cardinality of the ESG categories and actions disclosed in its sustainability report (Sect.3.5.2).Lastly, we consider nine company-related features as further predictors encompassing five company characteristics and four annual financial attributes.Specifically, we include in our predictors the company sector, the country, region and continent of its headquarters, and its incorporation year.On the other hand, we also include financial features concerning the fiscal year of the analysed sustainability report: EBITDA (Earnings Before Interest, Taxes, Depreciation and Amortisation), liabilities, market capitalization and the number of employees.
The descriptive statistics of these report-based and company-based predictors are exhibited in Fig. 5 and Table 4. Categorical variables, such as the company sector, are turned into binary indicator variables (i.e., dummy variables [110]), generating a total of 97 features whereas standardisation is applied to those numerical such as market capitalization.An example of an observation, encompassing a complete list of features, is exhibited in Sect.SM2.5.1 of the Supplementary Material (SM) document.Then, we adopt an Ordinary Least Squares (OLS, [111]) regression with Elastic Net Regularization [112] to perform inference on companies' ESG scores available.We adopt Elastic Net Regularization, a generalisation of the LASSO method [113], to perform both feature selection and training regularisation.We choose this regularisation method as it improves performance when the number of predictors (|features| = 97) is higher than the observations (|companies| = 89) as well as in the presence of strong pairwise correlations [112].We train the OLS model using the Elastic Net cost function and an eight-fold cross-validation approach [114] (see Sect.SM2.5.3).The performance of the OLS model is discussed in Appendix E.
Subsequently, we employ the SHAP framework [13] to investigate which predictors impact the inference of ESG scores most.SHAP is a model-agnostic and additive feature importance measure which is based on cooperative game theory.It provides local interpretations of model predictions as additive sums of the directed effects of each predictor using a conditional expectation function.SHAP starts from the prior knowledge of the expected model output E[f (X)], and then evaluates, for each model prediction, the magnitude and direction changes in this expected value (SHAP values) when conditioned on each predictor.It thus quantifies the magnitude and direction of the observed effects of each predictor.
Predictors with positive SHAP values affect the expected model output with additive increments, conversely, those with negative values have additive decrement impacts.

Results
In this section, we first report the network statistics computed at the node level from the three bipartite graphs in Sect.4.1.Next, a diversity analysis examines how ESG topics are disclosed across companies and different sectors (Sect.4.2).Section 4.3 reports company similarities according to jointly disclosed ESG actions.The follow-up section (4.4) addresses whether these company similarities are associated with similarities in other company information.Finally, we evaluate the interpretability of ESG scores by investigating the most impacting factual aspects (Sect.4.5).A qualitative analysis of the generated triples and an ablation study on the model instruction are exhibited in Appendix A and Appendix B.

Bipartite graphs' analysis
We here present some network statistics concerning the three bipartite graphs B cocat , B catpred , and B coact .Further statistics and extensive tables for all three bipartite graphs are shown in the Supplementary Material document in Sect.SM3.2.
The ESG movement encompasses many socially responsible issues Our data-driven methodology unveils that the one hundred and twenty-four companies disclosed 542 distinct ESG topics/categories in their sustainability reports.The company-category bipartite graph B cocat has an average degree distribution equal to almost 11%, making this graph relatively connected.There are however some mainstream ESG categories: Climate Risk Management, Supply Chain, Energy and Corporate Governance are connected to, and thus disclosed by, almost all the considered companies (degree > 92%, Table 5).Conversely, the list of the least disclosed categories is rather surprising: Market Responsibility, Anti-Discrimination and LGBTQ+ Inclusion are connected to, and thus disclosed by, less than 5% of all the considered companies (Table 5).
ESG topics are addressed from several perspectives, with some frequent ones The average degree centrality of the category-predicate bipartite graph B catpred is less than 1%.There are however some predominant predicate nodes (see Sect.SM3.2.3) that are associated with more than ninety ESG categories (degree ≥ 16.6%) such as Commitment and involvement with (113 categories), Advisor support for (102), Partnership with (97), and Establishment of (94).These prominent nodes interestingly exhibit high closeness centrality (> 88%) in contrast to their relatively low degree centrality values.This indicates that common category nodes indirectly connect them.Common actions are the exception The company-action bipartite graph B coact connects the company nodes to almost twenty thousand different ESG-related actions (19,574) disclosed in companies' sustainability reports.However, there are a few prominent actions disclosed by the majority of the considered companies: AIR EMISSION: Reduction of (degree of, and thus connected to, the 70% of the companies), ENERGY:Reduction of (61%), PHILANTHROPY:Donation by (60%) and CLIMATE RISK MAN-AGEMENT:Assessment of (56%).

Diversity analysis on disclosing ESG categories
In this section, we analyse the differences in disclosing and addressing ESG topics in companies' sustainability reports.

Companies approach ESG topics from plenty of perspectives, especially those generic and vague
The predicates associated with each ESG category vary significantly with an average Shannon's entropy of 1.5 nats.This broad action variety is predominant in generic and umbrella ESG topics such as Corporate Governance, Human Rights, and Supply Chain (Table 6).For example, when addressing Product Safety, companies approach it from different perspectives, ranging from developments (2.3%) to regulatory compliance (2.3%) and assessments (4.4%, Table 6).However, a notable correlation (corr = 0.84) exists between the entropy of categories and the number of companies disclosing information about a particular ESG category.This suggests differences in how each company approaches these ESG topics.
Cross-sector vs sector-focused topics A proportion of almost 12% among all the 542 ESG categories is disclosed across all company sectors, encompassing various umbrella aspects such as Climate Risk Management, Supply Chain and Business Ethics.However, it is noteworthy that certain sectors emphasise specific topics more than others (Table 7).For example, Packaging is more emphasised in Consumer Staples companies, such as PepsiCo (18% of all the company triples), Coca-Cola (9%), Monster Beverage (6%), and Tesco (5%).This category accounts for 4.9% of all generated company triples within this sector as exhibited in Table 7.Another example is Water which is more stressed

Comparison with ESG materiality at the sector and industry level
We compare the datadriven evidence of sector-focused ESG topics with the sector-level ESG materiality matrix (Sect.2.3) identified by Khan et al. [35].We explore the relevant issues identified for the Financials sector for explanatory purposes as is one of the common sectors between the authors' work and ours.There are thirteen financial companies in our sample: nine commercial banks (e.g., Deutsche Bank), two credit service companies (e.g., Mastercard), one insurance company (Assicurazioni Generali) and one asset management company (3i Group).Our data-driven findings unveil that the companies' sustainability reports address the ten issues identified as relevant for this sector with different importance.Financial companies, for example, extensively disclosed actions concerning: environmental, and social impacts on core assets and operations (Climate Risk Management: 7% of all the sector triples), Diversity and Inclusion (Financial Inclusion: 6.1% and Board Diversity: 5.2%) and Lifecyle impacts of products and services (Product Sustainability: 4.1%).On the other hand, companies' non-financial disclosures neglect to address some relevant issues encompassing: Access and affordability (Access to Basic Services: 0.8%, Accessibility: 0.3%, and Access to Information: 0.1%), Fair marketing and advertising (Marketing and Advertising: 0.1%), and Business ethics and transparency of payments (Business Ethics: 0.2% and Anti-Money Laundering: 0.1%).A table exhibiting the comparison with all ten issues can be found in Sect.SM3.6 in the Supplementary Material (SM) document.Furthermore, we select UniCredit, a financial company operating in the industry of Commercial Banks, as an explanatory example to compare our findings with the relevant disclosure topics outlined in SASB standards for its industry.This reporting standard organization identifies six important disclosure topics for commercial banks [115] which were addressed differently in the company's sustainability report.The company focused on industry-relevant issues encompassing: Financial Inclusion & Capacity Building (Financial Inclusion: 4.6%) and Data Security (Data Privacy: 2.4%).It however neglected to disclosure much information concerning issues such as Financed Emissions (Product Design: 0.4%) and Incorporation of Environmental, Social, and Governance Factors in Credit Analysis (Environmental Risk Assessment: 0.4%).
Different sectors employ tailored actions to address ESG topics Although there are a few widely disclosed actions (Sect.4.1), the same action is disclosed, on average, by less than 2% of the considered companies.Whereas, only 15% of the company sectors are, on average, engaged in the same action.This unveils different priorities and a variety of approaches among companies and sectors.For example, the Assessment of aspects concerning Climate Risk Management are more emphasised by Real Estate companies which manage assets vulnerable to climate risks, such as Park Hotels Resorts (2% of all the company triples) and Sun Communities (1%).Conversely, Materials companies such as United States Steel (1%), Yamana Gold (1%) and Aluminum Corporation of China (1%), emphasise instead the Commitment and involvement concerning Employee Safety, a worker-related topic.Further extensive tables are in the Supplementary Material document (see Sect.SM3.1).

Company similarities based on disclosed ESG-related actions
Here, we discuss company similarities according to jointly disclosed actions using the Jaccard similarity coefficient (see Sect. 3.5).
Companies from the same sectors tend to perform similar actions For example, as depicted in Fig. 6 and outlined in Table 8, five companies among the top ten similar companies of Deutsche Bank are banks too: Royal Bank of Canada (action similarity equal to 7%), Banco Santander (6%) and UniCredit (6%).Notably, Visa and Mastercard, both operating in the Credit Services industry, form a distinct group (Fig. 6).
Figure 6 A network diagram linking companies that report similar actions, determined by the Jaccard similarity coefficient.It exhibits only connections between companies with a similarity equal to or greater than 6%.Node colour corresponds to distinct sectors, and node size is proportional to their connectivity.Some connections are noteworthy for linking companies within the same sectors or geographical regions Comparably, action similarities emerge in companies operating in the Health Care sector: Moderna, Vertex Pharmaceuticals and AstraZeneca (Table 8).Further details are visible in Sect.SM3.4 of the Supplementary Material document.
Companies from the same geographical region tend to perform similar actions For example, as can be visually noted in Fig. 6, 80% of the ten most similar companies of Sony (Japan, Eastern Asia) are companies from the same geographical region: 40% from Japan and 40% from South Korea.Similarly, Geely Automobile (China, Eastern Asia) has 70% of its ten most similar companies from the same region too: 40% from China and 30% from South Korea.On the west side, there are six European companies in the ten most similar companies of Enel (Italy, Southern Europe) with Italian companies representing 40% of the total.

Correlation analysis among company similarities
This section answers the research question concerning whether company similarities in terms of jointly disclosed actions (Sect.4.3) are associated with similarities in other company information (Sect.3.1.3).We perform a bivariate correlation analysis using Kendall's correlation coefficient (Sect.3.5) for each company with all information available: 81% of the considered companies.Aggregated results are exhibited in Fig. 7 through box plots.
Similarities in companies' disclosed actions are correlated with companies' geographical regions Action similarities have the highest, yet weak, correlation with the Region and Country of the company headquarters, with a median correlation coefficient of 0.18 and 0.15.This confirms the empirical findings discussed in the previous section (4.3).Moreover, only these two features demonstrate median p-values, resulting from the null hypothesis test of zero monotonic correlation, below the established accepting threshold of

Figure 7 Distributions of pairwise correlations between companies' action similarities and similarities in other company features (rows).
Features are color-grouped according to their type of information.Light-green features categorical company characteristics, while azure and dark blue features represent numerical features concerning companies' financial and ESG information 5%, respectively 1% and 2%.The p-value distributions for all the features are shown in Sect.SM3.7.1 of the Supplementary Material (SM) document.Taking the previous example companies, Sony and Enel exhibit a relatively high monotonic correlation between their company similarities in terms of disclosed actions and geographical locations.Sony has an action-country similarity correlation equal to 0.22 and an action-region correlation equal to 0.20, while Enel exhibits a lower action-country correlation (0.14) and a higher action-region correlation (0.25).
No other statistically significant similarity correlations emerge Company similarities in ESG scores and Industries have a median pairwise correlation with companies' disclosed actions equal to 0.1 and 0.09.Their statistical significance appears however relatively weak due to high median p-values (13% and 15%), which suggest accepting the null hypothesis of zero monotonic correlation.
ESG scores are only correlated with their underlying components After analysing company similarities from the disclosed action perspective, we perform a pairwise monotonic correlation analysis to unveil possible confounding factors for company similarities.
Strong monotonic correlations appear between similarities in companies' Region and Country (median correlation equal to 0.7) as well as between companies' ESG score and their Social (0.5) and Environmental Pillar scores (0.4).No other relevant correlations emerge for ESG scores or other company information.A graphical representation of all the pairwise correlations is exhibited in Sect.SM3.7.2 of the Supplementary Material (SM) document.

Interpretability of ESG scores
Lastly, we investigate the interpretability of companies' ESG scores by employing a firstorder linear regression and the SHAP framework (Sect.3.5.5).We specifically evaluate how various factual and corporate aspects impact these scores using features based on the companies' extracted actions (such as the most disclosed ESG topics), and additional financial and company-specific information.
Social-related actions and company transparency have a significant impact on ESG scores On average, the most impacting aspects affecting ESG scores are the percentages of disclosed actions related to Human Rights and Employee Development, with a mean SHAP value of 2.6 and 2.7.High percentages of the former (colour scale in Fig. 8) positively impact ESG scores, while the latter has the effect of hurting them.High disclosing percentages in actions related to Philanthropy (mean SHAP value of 1.1) and Energy (0.7) also hurt companies' scores.Similar average magnitude, but an opposite effect, disclosing several actions related to Waste (0.8) or Supply Chain (0.7) has a positive impact.Furthermore, high variety in the disclosed ESG topics (Category Entropy) positively affects ESG scores with a mean SHAP value of 2.1 (Fig. 8).Sharing a similar average magnitude (1.9), being founded earlier, represented by an older Incorporation Year, positively impacts a company's score.This is also validated numerically by a negative Kendall correlation equal to -0.22 (p-value of 0.5%) and visually by Fig. 9 which groups the companies' ESG scores by their decade of incorporation.The median ESG score of the fifty-three companies funded in the 20 th century is 77.2, whereas the thirty-one companies established in the current century exhibit a lower median score of 68.6.Notably, the three companies founded in the 19 th century exhibit the highest median ESG score equal to 79.6.Further noteworthy factors positively impacting ESG scores are being a European company (CONTINENT:Europe, mean SHAP value of 0.7) and exhibiting a high level of Liabilities (0.5).In contrast, high annual earnings (EBITDA, 0.6) have a slight negative impact on ESG scores.The positive influence of being European can be also validated by grouping ESG scores by company region: the twenty-six companies from Europe exhibit the highest average ESG score equal to 82, the forty-nine American companies have an average ESG score of 69.7, whereas the average ESG score of the twelve Asian companies is equal to 67 (see Sect.SM1.5 in the Supplementary Material document).

Local interpretability analysis reveals company-specific impacting factors
Moving from global to local interpretability, we choose Sony as an example company and investigate the most impacting factors for its ESG score (Fig. 10).The Incorporation Year and Category Entropy features, respectively far below (standardised value of -1.03, representing the year 1946) and above (0.87, 3.6 nats) the average of company values, positively affect its score.In addition, disclosing several actions related to Human Rights (0.57, 4.4% of all its extracted actions) and Waste (1.31, 3.4%) has a positive impact.In contrast, disclosing fewer Energy-related actions than the average (-0.98,not among its

Region-based interpretability analysis reveals common patterns
We also conduct a more granular analysis by exploring the ESG score interpretability of a company cluster.Coherently with the example company, we select Asian companies encompassing five Chinese companies (e.g., Alibaba), five Japanese companies (e.g., Sony), Aramco (Saudi Arabia) and Greely Automobile (Hong Kong).The Incorporation Year strongly affects their ESG scores, with an average SHAP value of 2.7, in line with the global interpretation.This is further validated by a strong negative Kendall correlation of -0.61 (p-value of 0.7%).All companies established in the 20 th century, such as Toshiba (1904, score of 93.6),Toyota (1937, score of 84.5) and Geely Automobile (1946, score of 75.4), exhibit ESG scores above 66.Whereas, those established in the current century have all lower scores such as Baidu (2000, score of 53.5),China Evergrande (2006, score of 52.8) and Aramco (2018, score of 42.9).From the geographical point of view, being a Chinese company (COUNTRY:China) hurts ESG scores with an average SHAP value of 0.6.This is further emphasised by the observation that all the Chinese companies consistently have ESG scores below 62.5, whereas both Hong Kong-based and Japanese companies consistently display higher scores.Furthermore, this analysis confirms that disclosing several actions related to Human Rights (e.g., Toshiba and Tokyo Gas) and Waste (e.g., Toyota and Sony) positively impact ESG scores, whereas their absence negatively affects them (e.g., Baidu and Daikin Industries).The ESG scores group by region, a details list of the considered Asian companies and the bee-swarm graph of the latter analysis are reported in the Supplementary Material (see Sects.SM1.5, SM3.8 and SM3.9).

Discussion
Now, we address the practical implications of our findings (Sect.5.1) as well as the methodological implications of our proposed approach (Sect.5.2).Lastly, we discuss some potential limitations of our work in Sect.5.3.

Practical implications
High action variety in addressing ESG topics As highlighted in Sect.4.1 and 4.2, companies address ESG topics from many perspectives, ranging from recognition and commitments to developments, partnerships and compliance.This foregrounds the complexity and joint efforts needed to address ESG-related aspects and the involved external subjects such as ESG rating agencies.Our analysis unveils that the same action is disclosed, on average, by only 2% of the companies, and by only 15% of all the company sectors, confirming a lack of a common approach across companies and different sectors.However, some ESG topics are addressed through a common strategy by the majority of the considered companies: the actions Air Emission:Reduction of and Energy:Reduction of are disclosed by 70% and 61% of the companies (Sect.4.1).

The ESG phenomenon has blurred boundaries and includes plenty of socially responsible topics
Concerning the ESG topics disclosed by companies, our methodology extracts 542 distinct ESG topics from companies' sustainability reports, representing an eighttimes greater set of topics originally included in the ESG categorization exploited as a semantic reference in this work (sixty-four categories, Sect.3.1.2).Firstly, this unveils the broad scope of the ESG phenomenon involving socially responsible topics ranging from Waste Management and Supply Chain to Employee Safety and Tax Compliance.Secondly, this highlights the presence of widely disclosed topics, such as Supply Chain, and sector-focused topics such as Packaging and Water for the Consumer Staples sector.The diversity analysis reported in Sect.4.2 confirms this sectorbased importance for certain topics.For instance, the ESG topic Water is more stressed by water-intensive companies, while Packaging is more emphasised by companies producing consumer staples (Table 7).This data-driven insight is also validated by the ongoing discussions in the ESG literature concerning ESG Industry Materiality (Sect.2.3).Furthermore, by comparing with sector-and industry-level ESG materiality matrices (Sect.4.2), we can assess companies' disclosures against the expected sustainability issues in their reports, highlighting variations in topic coverage.

Exogenous factors might influence companies' non-financial disclosures
The findings reported in Sect.4.3 emphasise company similarities based on their sectors, confirming indirectly a relatively high presence of common strategies among companies from the same sector.However, the most impacting factor in grouping companies based on their disclosures is their geographical region as shown in Sect.4.3 and Sect.4.4.This represents an interesting finding from our data-driven work, validating the ongoing discussions in the literature concerning the influence of the company's geographical origins on their non-financial disclosures (Sect.2.3).For example, other studies have analogously unveiled the impact of exogenous factors on these disclosures: encompassing region-specific regulations [62] to the political, labour, and cultural environment of the company's nation [32,33,63].
Companies' social and environmental performance hold greater importance than the governance performance in the combined ESG scores The bivariate correlation analysis reported in Sect.4.4 shows that similarities in ESG scores are neither associated with similarities in disclosed actions nor other financial or company characteristics, representing a noteworthy finding of our work.It however unveils strong monotonic correlations between similarities in the companies' region and country as well as between the ESG score and the social and environmental pillar score (Sect.4.4).These two appear fairly trivial associations: first, the ESG score is a weighted score combining the scores of the three E/S/G pillars (Sect.3.1.3);second, the region and country have a natural geographical relation.However, the monotonic associations of ESG scores could be exploited to roughly infer the average influence, and thus the importance, of the E/S/G pillar scores towards the combined score.For instance, a weak or zero monotonic correlation suggests that (dis)similarities among scores of one specific pillar are not associated with (dis)similarities in the combined score.This could imply a particular pillar holds relatively less importance, or weight, in determining the combined ESG scores.Conversely, when a significant pillar is present, its (dis)similarities reflect the (dis)similarities of the combined ESG scores.Hence, based on the monotonic associations of ESG scores, it could be inferred that, on average, the social pillar (0.5) holds slightly greater importance compared to the environmental pillar (0.4).In contrast, corporate governance bears minimal importance in ESG scores (0.2).Although the company sample and the fiscal years considered may influence the inference of the rating agency's methodology, insights on E/S/G weightings can be helpful to validate the impacting factors of ESG scores unveiled in Sect.4.5.
Exogenous factors can influence the quality of companies' non-financial disclosures, indirectly impacting their ESG performance assessments The interpretability analysis of ESG scores (Sect.4.5) highlights that the company's disclosures impact ESG scores more than other financial aspects or company characteristics.Disclosing several ESG topics (category entropy) positively affects ESG scores, whereas fewer disclosed ESG topics hurt scores.This data-driven insight accordingly confirms that transparency on non-financial information rewards companies' ESG assessment [63,66].The analysis of Sect.4.5 also confirms the negligible impact of governance-related topics towards ESG scores in comparison to social-and environmental-related topics such as Human Rights, Energy and Waste.The findings of Sect.4.5 also unveil that disclosing many actions related to the ESG topics Employee Development and Energy negatively impact ESG scores.One hypothesis is that their high presence in a company's sustainability report might reveal a poor coverage of other important ESG topics [76].For example, in the local interpretation of Sony's ESG score, a low disclosing percentage of Energyrelated actions and a high disclosing percentage of Waste-related actions positively impact its score.Another noteworthy finding of this analysis is the region's impact on ESG scores: being a European company positively impacts companies' ESG scores, whereas being a Chinese company hurts them (Sect.4.5).The Chinese penalization factor might be reinforced by the fact that all the considered Chinese companies were funded between 1999 and 2006, likely due to the remarkable economic development of this region starting in the 2000s, and thus associated with the negative impact of being relatively young companies (Sect.4.5).Indeed, our interpretability analysis also uncovers a beneficial effect associated with earlier incorporation years.However, additional investigation is necessary to exclude spurious correlations.The impact of the company's region on ESG scores is however coherent with the region-based disclosing similarity previously highlighted and validated by ongoing discussions in the ESG literature concerning a European bias (Sect.2.3).Further validation of the positive impact of being a European can be found by delving into the environmental performance of European companies in our sample: they demonstrate the highest average environmental score of 81.8, surpassing Asian companies with an average score of 69 and American companies with the lowest average score of 65.6 (see Sect.SM1.5 in the Supplementary Material document).
On the other hand, other studies suggest that companies' ESG assessments might indirectly suffer from biases based on company size (Sect.2.3).The interpretability analysis in Sect.4.5 unveils negligible evidence of this company size bias: a greater number of fulltime employees has a positive, yet marginal, impact on ESG scores (average SHAP value of 0.24), whereas the company's market capitalization has no impact on interpreting companies' ESG scores.In addition, the Kendall pairwise correlations of company similarities (Sect.4.4) concerning these two company variables and similarities in ESG scores are not statistically significant: a monotonic correlation of 0.1 for the number of employees (pvalue of 71%), and a zero correlation for the market capitalization (p-value of 78%).

Methodological implications
As mentioned in Sect.3.3, generative LLMs provide us with the semantic understanding and flexibility needed to overcome the limitation of traditional OIE approaches which rely only on the syntactical sentence structure.However, employing a 7-billion LLM [16] for information extraction (see Retrieval-Augmented Triple Generator in Fig. 1) leads to a higher computational load.Nonetheless, this allows us to generate semantically aware and ESG-focused triples instead of traditional SPO ones.This is pivotal in generating all the meaningful ESG-related insights of our work.
Utilising generative LLMs and the ESG categorization for semantic guidance enhances retrieving more comprehensive data-driven insights The flexibility and generative abilities of these generative language models also allow us to highlight, and overcome some limitations in the data sources such as those of the ESG categorization used in our work (Sect.3.1.2).This categorization extrapolates a concise set of ESG topics by categorising several ESG-related indicators shared among ESG rating providers.It accordingly derives the scope of the ESG phenomenon from the perspective of rating agencies, a different viewpoint compared to companies' disclosures analysed in our work.However, these different points of view, and the methodology based on generative LLMs, help us to unveil differences among the ESG topics considered by rating agencies and disclosed by companies.We indeed extract a set of ESG topics/categories disclosed in companies' sustainability reports that is eight times larger than the original list of categories of this classification (542 versus 64, Sect.5.1).For instance, our methodology unveils "Education" as a pivotal ESG topic disclosed by more than two-thirds of the selected companies.This topic is not explicitly included in the categorization, although it could be framed within three of its categories: Access to Basic Services, Human rights (Art.26) or Philanthropy.Additional examples encompass the extracted ESG category "Circular Economy" which might fall under the categorization categories of "Waste" or "Resource Efficiency" as well as the ESG category "Air Quality" which could be framed within "Green Buildings" or "Health and Safety".
Accordingly, the aforementioned ESG categorization encompasses critical topics hidden within vague categories or potentially overlooks them altogether, resulting in a reduction of substantial significance in subsequent analyses.Our approach addresses this limitation by leveraging a generative LLM in conjunction with the ICL technique and the RAG paradigm.This allows us to jointly simulate the outputs of a supervised text classifier, whose labels are the categories of the ESG categorization, and semantically generalize those labels.The ESG categorization is consequently leveraged as semantic guidance by the LLM, helping itself to extract more suitable topics while keeping the domain and semantics of the original ESG categorization.This semantic generalization can also diminish the reliance on specific ESG categorizations when classifying sentences since the ESG categories are exploited as semantic references rather than fixed labels.Nevertheless, this could also lead to the undesirable phenomenon of over-specialization which was tackled using semantic clustering (Sect.3.4).We also used this ESG categorization, in the data preparation phase, to filter the report sentences using the text embeddings (Sect.3.2.1).This semantic-oriented filtering approach allows us to further move towards a taxonomyagnostic methodology as filtering is based on semantics rather than single keywords.
Extracting insights from companies' sustainability reports using generative LLMs, and graph representations Lastly, our methodology differs from other recent ESG-focused and LLM-based tools (Sect.2) such as ChatClimate [58] and ChatReport [60] by employing the paradigm of Retrieval-Augmented Generation (RAG), alongside In-Context Learning, for Knowledge Graph generation.This methodology, in combination with bipartite graph representation, allows us to report meaningful insights concerning the actions disclosed in companies' sustainability reports.In comparison, ChatClimate adopts the RAG paradigm to augment ESG-related questions for question-answering, whereas ChatReport leverages this paradigm to operationalise the compliance assessment of sustainability reports towards the recommendation guidelines of the Task Force on Climaterelated Financial Disclosures (TCFD).

Limitations
Because of a significant computational workload, we endeavour to present insights concerning a sample of companies encompassing several sectors, regions and sizes.However, a greater subset -such as 1000 companies' reports -would further validate our findings and enable a more substantive analysis.
Our data preparation NLP pipeline relies on a PDF parser [78] to extract texts from sustainability reports.This parser extracts all texts including those from infographics and tables.This might yield some sentences without a proper syntactic structure, making extracting semantic meaning from them difficult or even impossible.Table 3 in Sect.3.2.2exhibits an example of this issue in the first sentence retrieved for the ESG category Waste.Although this sentence contains some details regarding this topic, it lacks a coherent message.However, the semantic understanding of LLMs in combination with the In-Context Learning technique and the paradigm of RAG could implicitly address this issue.Indeed, the sentence coverage of our retrieval-augmented triple generation (Sect.3.3) is equal to 68.1 %, meaning that the language model acts as an implicit filtering layer and avoids generating triples for just about 30% of all the processed input sentences.The aforementioned example is within this set of ignored sentences.Although an end-to-end approach might be desired, discarding such meaningless sentences beforehand could help avoid an unnecessary computational workload.For instance, future works could tackle this issue by enhancing document parsing (e.g., by preserving the original layout) or adding a further, yet lightweight, filtering component.The latter might filter sentences according to their syntactical correctness or meaningfulness.
Another potential limitation concerns the interpretability of ESG scores using SHAP values.The SHAP framework is used to roughly interpret the impact of predictors on individual predictions.Global interpretability is derived using simple aggregating statistics such as mean/median SHAP values.Nevertheless, this aggregating approach for global interpretability might result in a mixed global interpretation in the presence of high diversity in the observations as can be a set of companies from worldwide nations covering eleven distinct sectors.The global impact of some predictors could still be accurate, yet some might be a mix of sector-dependent relationships or caused by the diversity of cause-effect connections.Indeed, the region-based interpretability analysis (Sect.4.5) unveils more impacting factors or relationships for a specific company cluster in comparison to the global interpretability (Sect.5.1).However, future works might conduct a further subset-based interpretability analysis by adopting a bottom-up approach and letting company groups emerge by themselves.
Lastly, the data provider for ESG scores used for our work might be a limitation worth highlighting.We rely on the ESG scores from the Refinitiv platform (Sect.3.1.3),but, as highlighted in the Introduction Section, rating agencies have their assessment methodologies which could result in divergences in companies' ESG scores.Consequently, the findings relying on ESG scores (Sect.4.4 and 4.5) might vary using ESG scores from other rating agencies such as Sustainalytics which adopts a risk-based assessment [116].In addition, future works could integrate further ESG-related attributes from these rating agencies such as quantifying companies' water withdrawal, hazardous waste, gender pay gap and employee turnover.

Conclusions
In this work, we proposed a data-driven methodology based on generative LLMs to systematically evaluate the context in which ESG topics are disclosed by companies in their sustainability reports.The objective of this work was to contribute to the emerging field of automatic information extraction from companies' sustainability reports by implementing the best NLP pipeline to extract structured insights from lengthy and visually rich PDF documents.This generative LLM-based approach allowed us to directly investigate the companies' perspective concerning the ESG phenomenon.
Large Language Models (LLMs) can be versatile tools to accomplish diverse NLP-related tasks including also extracting structured information from textual data.We further explored this promising research direction by adopting the Retrieved-Augmented Generation (RAG) paradigm, alongside the In-Context Learning (ICL) technique, to extract ESGrelated information as semantically structured triples.We then adopted a graph representation (bipartite graphs) to extract non-trivial statistics and conduct meaningful analyses concerning companies' disclosed actions.We employed a pre-trained language model from the open-source community, distinguishing us from other recent publications as far as we know.Furthermore, our LLM-based methodology overcomes important limitations related to traditional OIE techniques and the ESG categorization, allowing us to generate both semantically-aware and ESG-oriented triples instead of traditional subjectpredicate-object (SPO) triples.This helped us to report meaningful findings such as statistical, similarity and correlation analyses on the ESG topics and actions extrapolated from companies' sustainability reports as well as conduct an interpretability analysis of ESG scores.Future works might integrate further data sources, such as ESG-related news, to analyse possible inconsistencies in companies' claims and actions.Another interesting research direction might be to integrate Semantic Role Labelling (SRL) to enhance the extracted structured information with semantic roles, such as the agent, manner, and purpose of an action, as well as other contextual information, such as time and location.

Appendix A: Qualitative analysis of the generated triples
We evaluated the generated triples by prompting the same LLM (WizardLM, [16]) to evaluate triple quality.We prompted the model to evaluate the coherence and alignment between the structured information (output) and the sentence (input), considering also the coherence of each triple attribute (cat, pred, and obj).The model was prompted to evaluate each, leveraging also its ICL abilities, using numerical scores on a scale from 0 to 3. The full model instruction used for this evaluation is exhibited in Sect.2.3.1 in the

Appendix C: Hyper-parameter choice of the large language model
Large Language models have some hyper-parameters for tuning their textual outputs, and consequently, some choices should be made to address these further degrees of freedom.The temperature parameter controls the randomness of the model responses by influencing the model's confidence in its most likely output [118].During the decoding phase, this parameter alters the model output by scaling the logits before applying the softmax function.A high temperature makes the model output more diverse and creative but also more unpredictable.Conversely, a lower temperature makes the model output more deterministic and focused.Setting a temperature equal to zero corresponds to greedy decoding [118].Accordingly, we opt for greedy decoding to ensure deterministic outputs and to make the generation process adhere to instructions as much as possible [60].
Another hyper-parameter that affects text generation is the beam number representing the number of tokens considered during the beam search algorithm [118].Beam search is a sampling decoding algorithm that improves the output of LLMs by pruning off bad thinking patterns at generation time.This algorithm works by iteratively generating a sequence of b dim tokens, and then outputting the sequence with the highest probability [119].We found through extensive experiments that the beam number (b dim ) ranging from 4 to 6 strikes a good balance between semantic representation and computational workload.We accordingly adopt a beam number equal to 6.
To conclude the review of the regression model performance, we conduct a residual analysis to check the linear assumptions required to properly shape the problem as a linear model.The assumption of normal distribution of the residuals (E i ∼ N(0, σ 2 )) is confirmed by the Anderson-Darling test [123] with a p-value equal to 6.6 % as well as through the QQ plot of residuals versus Normal distribution showing points lie on a line.Concerning homoscedasticity, a condition in which the residual variance is constant across all the model predictions, there are no visible patterns in the scatter plot of residuals versus predicted ESG scores.The same condition is confirmed by the scatter plot of the pre-dicted ESG scores versus the actual scores.However, a slight overestimation trend might be spotted for ESG scores below 50, showing a limit of our predictors for interpreting these low scores.A graphical panel with all the graphical residual analyses is shown in the Supplementary Material document (see Sect.SM2.5.4).

Figure 2
Figure 2 Two labelled examples included within our model instruction.The input sentences were created to cover different syntactical structures, and do not represent actual information exhibits two labelled examples included in the model instruction to leverage the In-Context Learning abilities of the LLM.

Figure 5
Figure 5 Descriptive statistics of the numerical features used as part of the predictors of the Ordinary Least Squares(OLS) model.The features labelled with the starting word "category" reflect the percentage disclosure of an ESG topic.To enhance readability, the figure exhibits only three category features for explanatory purposes.The statistics are presented before standardisation

Figure 8 Figure 9
Figure 8 Summary of the top sixteen features impacting the most the inference of ESG score.The features are ordered according to their median shape value.The x-axis represents the degree of a positive and negative impact on model output.Each dot represents a company instance and colours represent the company values of the standardised feature

Figure 10
Figure 10 Example of explanations for individual predictors for the ESG score of a company.The category-based features are extracted from the 2022 sustainability report of the Japanese company Sony and other company information from the same fiscal year is considered.In addition, the actual ESG score and the model error (residual) are shown

Table 1
Definitions of the terms used in this study

Table 2
Selected companies by sector.Each sector showcases the number of companies represented.The "Companies" column shows a glimpse of the representative companies within each sector, offering a snapshot of the prominent companies chosen

Table 3
Example of the top three sentences selected for two ESG categories.The approach is capable of retrieving sentences that pertain to the two topics.However, it may also pick some meaningless sentences, such as the first one for the Waste topic, which comes from an infographic or a tabular layout , both in Italy and in Brazil, are covered by Collective Labour Agreements reached with trade union organizations and . ..0.72In addition to the protections and rights provided by law and the national collective labour agreement for the sector . . .

Table 4
Descriptive statistics of the categorical variables used as part of the predictors of the Ordinary Least Squares (OLS) model.All these variables are transformed into binary indicator variables for each observation

Table 5
Graph metrics of a sample of all the 542 category nodes of the bipartite graph B cocat

Table 6
A sample of the ESG categories with their entropy values computed.The three most frequent category predicates are reported alongside the percentage of companies disclosing that category

Table 7
Sample of all the ESG categories disclosed by companies in their sustainability reports.This table exhibits the category coverage through different percentages concerning: (i) the company triples including a category, (ii) the companies reporting a category (iii) also aggregated by sector, and (iv) the company triples aggregated by sector

Table 8
A company sample with the top three most reported actions and the most similar companies for each.Company similarities are assessed by computing the Jaccard similarity on the companies' disclosed action set

Table 9
Evaluation of a sample of five generated triples and their original sentences.The triples were automatically evaluated by an LLM outputting a 0-3 numerical score for each triple attribute

Table 10
Three comparing examples of the triples generated with/without using In-Context Learning in the model instruction

Table 11
Three comparing examples of triples generated with/without the semantic output data schema in the model instruction complete semantic schema can be seen in the Supplementary Material document within the full model instruction (see Sect.SM2.1).