Misery loves company: happiness and communication in the city

The high population density in cities confers many advantages, including improved social interaction and information exchange. However, it is often argued that urban living comes at the expense of reducing happiness. The goal of this research is to shed light on the relationship between urban communication and urban happiness. We analyze geo-located social media posts (tweets) within a major urban center (Milan) to produce a detailed spatial map of urban sentiments. We combine this data with high-resolution mobile communication intensity data among different urban areas. Our results reveal that happy (respectively unhappy) areas preferentially communicate with other areas of their type. This observation constitutes evidence of homophilous communities at the scale of an entire city (Milan), and has implications on interventions that aim to improve urban well-being.


Introduction
For the first time in history, the majority of humans now lives in cities. A complete theory concerned with the growth and dynamics of cities is still a work-in-progress []. However, our phenomenological understanding of cities is growing significantly thanks to progress in our ability to sense the dynamics of human behavior [], and the urban environment and infrastructure [, ].
The high population density in cities is associated with both desirable urban indicators such as innovation, economic growth, and employment opportunities, and with undesirable outcomes such as crimes, diseases and pollution [, ]. In particular, cities exhibit consistent sub-linear and super-linear scaling of many of these indicators [].
These characteristics of cities are attributed to many factors [, ]. Among those, special attention is often given to the role of social interaction [, ]. Characteristics of human social interaction, such as the role of weak ties [], structural holes [] and the diversity in interaction [] are often seen as important facilitators of success in cities. Recently, it has been suggested that cities are special because the increased urban population density leads to super-linear scaling in social tie density, thus facilitating super-linear scaling of information spreading [].
On the other hand, the societal success of cities is also a function of urban well-being and happiness. Policy-makers are interested in understanding the drivers of happiness in cities in order to sustain or increase it []. Thus, it is important to understand the interplay between urban communication, on one hand, and urban happiness on the other. This would help us understand, for example, whether and how urban communication structures facilitate or inhibit the well-being of citizens. In this paper, we take a first step towards investigating this issue.
Traditional studies rely on self-reporting through surveys to measure happiness at the level of entire cities [-]. Despite the pervasiveness of surveys in quantifying or indexing happiness measures, they suffer from a number of limitations, such as the unreliability of subjective data [].
Recently, researchers showed that the abundance of personal data emitted through social media (e.g., short-message broadcast mediums like Twitter) can reliably quantify individual happiness [-]. When combined with geographical information, this technique can be used to characterize the geographical distribution of happiness across large areas, such as the continental United States [].
Motivated by recent results on the geography of happiness and the communication structure in cities, we are interested in understanding the relationship between urban happiness and urban communication. We leveraged datasets provided by the Big Data Challenge that was organized by Telecom Italia []. We used the communication intensity data to build a directed network of urban areas whereby the weights of the edges indicate the strength of communication between the areas. To estimate the happiness level of urban areas, we used Dodds et al. 's method [] to analyze the sentiment of geolocated short-message broadcasts (also known as tweets) initiated in these areas, which after aggregation (averaging) gave us a reliable approximation.
After preprocessing the data, we demonstrated the effectiveness of social media in mapping happiness at a much finer spatial resolution (within an urban area). Then, we investigated the relationship between communication among different geographic areas and their happiness levels. We found that communication patterns of urban areas exhibit homophilous behavior. More precisely, happy urban areas tend to interact with other happy areas more than they interact with unhappy areas. Similarly, unhappy urban areas tend to interact with other unhappy areas more than they interact with happy areas. The urban homophily in happiness that we witnessed in our dataset supports previous findings on homophily in happiness among individual humans [], and shows that this phenomenon persists at larger scales. Our result is relevant to policy-makers to guide them in setting strategies that increase happiness, which is itself correlated with important outcomes ranging from crime and health, to productivity and innovation.

The dataset
This work uses four datasets released by Telecom Italia for the Big Data Challenge  []. The datasets were collected during November and December . Among the released datasets, the following four were used.
• 'Milano Grid' . The city of Milan was divided into a spatial grid of  ×  cells. This dataset contains the ID of each cell in the grid along with the geometry of the cell. We will use the terms cell and area interchangeably. over -minute intervals. Figure  shows the distribution of outgoing calls and Internet traffic over the cells. One can observe that their distributions are partly characterized by a power law. However, the tail of the distribution exhibits an exponential cutoff, likely caused by cognitive saturation in the communication capacity of people within individual cells []. Similar distributions can be found for the incoming calls and the incoming/outgoing SMSs (check Section E of Additional file ). • 'Telecommunications -MI to MI' . This dataset contains the directional interaction strength between the different cells in the grid. This is based on the calls exchanged between mobile phone users in Milan between Nov st and Dec st, . We used this dataset to construct the weighted directed network of interaction among cells. • 'Geo Tweets' . This contains about , tweets that are geo-located in Milan. Figure  shows the distribution of the tweets in Milan, highlighting variability among different areas, and peaking at the center.

Preprocessing tweets
From 'Geo Tweets' dataset, we extracted tweets written in Italian and English. These tweets constituted about % of the overall tweets (check Section H of Additional file ). Then, using the free Google Translate API (Goslate), we translated Italian tweets to English.

Measuring happiness of cells
To measure the happiness in tweets and accordingly in cells, we used a total of , words with their happiness scores on a scale of  (unhappy) to  (happy). This data was used in various studies [, , ], and is available online []. Following the existing methodology [, ], we removed all words with a happiness score between  and , then the happiness score for each tweet was calculated depending on the words it contains. For a given tweet T, containing N unique words, the average happiness was calculated using where f i is the frequency of the ith word w i in T and h avg (w i ) is the average happiness of the word w i . Words that do not have happiness scores are given the value of zero happiness. Tweets with zero happiness scores were discarded because they do not provide any information about the sentiment of the area they belong to. Among the considered tweets, % of the tweets (K out of K) have zero happiness score. Figure (a) shows a histogram of the happiness scores of tweets' in our data. a We used the implementation of Point Inclusion in Polygon Test by W. Randolph Franklin [] to map tweets to cells. Tweets that do not map into the grid (i.e. are not geo-located in Milan) were discarded. We also discarded cells with zero happiness scores. A cell has a zero happiness score either because it has no tweets (i.e. no tweets are mapped to it), or because all its tweets have zero happiness scores. Then, cells with fewer than ten unique Twitter users were discarded since they provide a very noisy measure of happiness. We are left with , cells, whose distribution of happiness scores is shown in Figure (b). A heat map for cells' happiness score in Milan is visualized in Figure (a).
Our investigation of homophily relies on a network of cells with discrete categories (happy and unhappy). So the first step was to classify each cells according to its happiness level according to the following:

Figure 3 Distribution of happiness. (a)
Histogram of a tweet's happiness. The histogram shows the distribution of happiness score per tweet. Tweets with zero happiness score were discarded. Similar to the distribution of average happiness in US cities found in [23], one can see that more tweets have happiness score above the average. (b) Histogram of a cell's happiness. The histogram shows the distribution of happiness score per area. Areas with zero happiness score and areas containing less than ten unique users were discarded. It is interesting to note that although the distribution of individual tweets' happiness is negatively skewed, the distribution of areas' happiness (i.e. aggregate happiness of these tweets) looks symmetric.
We removed all neutral cells since we are only interested in interaction among happy/ unhappy cells. A heat map of happy, unhappy, and neutral cells is shown in Figure  The results presented in the paper are generated using the  percentile. However, we tried other different percentiles (, ,  and ) to label cells as happy or unhappy and reported the detailed results in Section A of Additional file .

Building communication network
To build the network, we used communication data for an entire workweek (from Nov th to Nov th, ) to characterize the urban communication network (similar results were obtained using communication data from a single day). We aggregated the calls' weights between cells during this week, then filtered out edges in which the calling/called cells were discarded in a previous stage (either because they had less than ten unique Twitter users or because they had zero happiness score). We also discarded self-edges that capture communication within cells. To remove the effect of transient communications, we used the weight of edges to filter out edges with weak connections (we discarded edges with aggregated weights less than .). Also, as observed in Figure , there is a variation in terms of the proportions of communication among areas. Therefore, the communication intensity between two areas can be attributed to their population [, ]. Unfortunately, the Big Data Challenge did not provide the population size in each area, and public population data is not available at the same level of granularity. Hence, as a proxy for population, we used the number of Twitter users who initiated tweets in each cell []. Our goal is to minimize the effect of population on the intensity of communication. Checking for corre- The figure identifies the happiness category (happy, unhappy, or neutral) of each cell. We discretized the happiness score for areas. We considered the areas with the highest 15% happiness scores as happy areas, while the areas with the lowest 15% happiness scores as unhappy areas. Remaining areas are considered as neutral areas.
lation between intensity of communication among two areas and the population in these areas would help to determine a good way to minimize this effect. We tested the correlation between the intensity of communication among a pair of areas with the minimum of the two areas' population, the product of the two areas' population, the average of the two areas' population, the calling area's population, and the receiving area's population. They all provided significant positive correlation values (Spearman's rank correlation: ., ., ., ., and . respectively). Thus, we divided the communication between each pair of areas by the lower population of the two:

Communication(A, B) min{population(A), population(B)} .
The resulting network consists of  nodes which represent cells/areas, each of which is labeled as happy or unhappy. These nodes are connected using , weighted, directed edges that represent the intensity of calls between areas.

Results
We start with a visual exploration of whether cells communicate preferentially with cells of their own type, we ran a community detection algorithm (the multi-level modularity optimization algorithm []) on the communication network among the urban areas that we classified as 'happy' and 'unhappy. ' The output is shown in Figure . Nodes labeled with + represent happy areas and nodes labeled with -represent unhappy areas. The different colors represent different communities identified by the algorithm. Most communities are dominated (≥   ) by a particular class (happy or unhappy). To further quantify this effect, we statistically investigated the variation in communication between areas of different happiness levels. We conducted a two-way Analysis of Variance (ANOVA) to compare the effect of levels of happiness (i.e. happy versus unhappy) of the source and target areas on the strength of communication, a relationship that can Figure 5 Network of calls between happy/unhappy areas in Milan. Nodes represent areas, and edges represent calls between these areas. Nodes labeled with + represent happy areas, while nodes labeled withrepresent unhappy areas. Colors represent communities generated by the multi-level modularity optimization algorithm [32]. One can see that most communities are dominated (≥ 2 3 ) by a particular class (happy or unhappy). be described by the following linear model: The dependent variable is the aggregate directional communication between the two areas (continuous number). The independent variables are: () Source, a factor with two levels of happiness representing the area that initiates the communication, () Target, a factor with two levels of happiness representing the area that receives the communication (happy or unhappy), () the interaction between these two factors (Source × Target).
The results of the ANOVA show that the interaction between Source and Target is significant (F(, ,) = ., p .). The interaction effect might indicate the existence of homophily or heterophily in the communication patterns of urban areas in Milan. Homophily means that areas of the same level of happiness tend to interact more with each other than they interact with areas of the other level of happiness, whereas heterophily means that areas tend to interact with areas of the other level of happiness more than they do with areas of the same level.
We produced an interaction plot to visualize the interaction effect. Figure  shows a tendency for homophily in communication. Taking into account the whole volume of communication, happy areas tend to call happy areas more than they call unhappy areas. Similarly, unhappy areas tend to call unhappy areas more than they call happy areas. The same behavior can be also noticed regarding receiving the calls. Happy areas receive more calls from happy areas than from unhappy areas, and unhappy areas receive more calls from unhappy areas than from happy areas.
In order to study the significance of the previous observations, we conducted post hoc comparisons using the Tukey HSD test [] to compare all the six possible combinations of weights from the interactions. Tukey HSD is a statistical test that is used with an ANOVA (a two-way ANOVA in our case) to do pairwise comparisons between the means of the   Table , and thus we have six pairwise comparisons). Table  shows the mean and the standard deviation values for the interactions between happy/unhappy cells. The mean of the communication weights of the unhappy areas calling and receiving calls to/from other unhappy areas is statistically significantly different from the mean communication weights of the unhappy areas communicating with happy areas (p .). Similarly, the mean of the communication weights of the happy areas calling and receiving calls to/from other happy areas is statistically significantly different from the mean communication weights of the happy areas communicating with unhappy areas (p .).
We have also quantified the level of assortativity mixing in the network of areas by using a weighted version of the assortativity coefficient defined by []. For more information about how we implemented it, please refer to Section D of Additional file . We found that the assortativity coefficient is . which could be considered relatively high. Hence, this is another evidence of the assortative behavior in the communication patterns of urban areas in Milan.

Homophily on community level
Given these findings, it would be interesting to know whether homophily exists on the community level. To investigate this, we used the output of the community detection algorithm (namely multi-level modularity optimization algorithm []). This algorithm uses the notion of modularity, which is a quality measure for graph clustering proposed by Newman []. After we found communities, we studied the effect of community size on the average and the standard deviation of a community's happiness score. In general, if the standard deviation is small, then one might conclude the existence of homophily on community level. Additionally, we are interested in finding whether the average and the standard deviation of a community's happiness score will change as the size of the community (measured as the number of cells) changes. For comparison, we generated random communities of similar sizes of the communities we have. A random community of a size h is formed by randomly assigning h cells into it. The process is repeated  times and the average value (of averages or of standard deviations) is calculated for the community. For more details, check Section G of Additional file . Figure (a) shows that small-size communities have slightly higher average happiness than a random community of the same size. As community size increases, its happiness score decreases to become less than that of a same-size random community. This suggests that smaller communities enjoy higher level of happiness than larger ones. Figure (b) shows that detected communities have lower standard deviation than the random communities, which suggests some evidence for homophily within communities. Additionally, it shows that the standard deviation of happiness score increases as the community size The detected communities have lower standard deviation than the random communities, which suggests some evidence for homophily. Moreover, as the size of community increases, the average happiness decreases and the standard deviation of happiness increases. However, random communities show a similar pattern of increase in standard deviation. This suggests that the increase in standard deviation is mostly not due to decrease in homophily. As the blue line shows, the standard deviation only slightly increases as the size of the community increases. Note that only cells within top/bottom 15% of happiness are considered. The use of other percentiles produces similar plots (check Section G of Additional file 1).
increases. However, random communities show a similar pattern of increase in standard deviation. This suggests that the increase in standard deviation is only slightly due to decrease in homophily. That is, homophily within a community is slightly influenced by the size of the community.

Discussion
We have taken a first step towards understanding the interplay between communication and happiness in urban areas at a high resolution. We found evidence of assortative mixing (homophily) in communication between different urban areas based on their happiness level. We also found that the mean of happiness seems to vary with community size, where community is defined in terms of communication structure. Obtaining our main result required developing a data science pipeline that combines data from a variety of sources and conducts social media data scraping, translation, sentiment scoring, aggregation, geo-location, and statistical hypothesis testing. We believe this type of pipepline can be used beyond the scope of this particular paper. For example, our claims about homophilous communication are limited to a particular indicator, namely happiness measured through public social media production. It may be possible to apply the same technique to measure homophily based on other sentiment indicators that can be extracted from social media, such as consumer confidence, or political opinions.
Certainly, Twitter is not the only way to measure happiness in cities, and it may be possible to establish assortativity using other measures of happiness such as collecting selfreported happiness of a sample of people through surveys []. However, these measures are expensive, particularly at the high spatial resolution obtained by this study. Moreover, using social media posts like Tweets provides a real-time indicator of happiness, and is therefore better suited for applications that require this information at higher temporal resolution.
The main limitation of the present study is that it involves a single city. This is caused by the limited availability of data. In the future, it would be necessary to conduct similar investigations for other cities to see if homophily holds consistently across a variety of urban centers. Even if the pattern does hold, it would be interesting to investigate whether different cities exhibit homophilous communication to different degrees.
Another opportunity for further work is to explore the role (if any) played by other urban indicators, such as income, in mediating our observations. It may be possible, for example, that our observed effect is more (or less) pronounced for areas with similar per-capita income.
We believe there are many opportunities for further exploration of the role of urban communication in urban well-being. An interesting, though challenging, experiment can involve running interventions aimed at manipulating the urban communication structure to see if a causal link can be made to urban well-being.

Additional material
Additional file 1: Supporting information.