News and the city: understanding online press consumption patterns through mobile data

The always increasing mobile connectivity affects every aspect of our daily lives, including how and when we keep ourselves informed and consult news media. By studying a DPI (deep packet inspection) dataset, provided by one of the major Chilean telecommunication companies, we investigate how different cohorts of the population of Santiago De Chile consume news media content through their smartphones. We find that some socio-demographic attributes are highly associated to specific news media consumption patterns. In particular, education and age play a significant role in shaping the consumers behaviour even in the digital context, in agreement with a large body of literature on off-line media distribution channels.


Introduction
Internet, the World Wide Web and, more recently, the pervasiveness of mobile technologies have radically transformed the way individuals consume cultural content.One of the areas that was impacted the most by new forms of digital media is journalism: how newspapers are accessed, how news are consumed.The abundance and diversity of topics have forever been altered [1].Gaining a better understanding of how different population groups are using and benefiting from on-line news services is now even more important.By leveraging mobile information, it is possible to map the variability of news consumption patterns of the population onto socio-demographic features such as the education level, age, income etc.
In this work, we want to explore the association between social inequality and news consumption in the context of digital news outlets.We focus on Chile, a country that in the past three decades has experienced an incredibly rapid economic growth along with a deep political transformation (from a dictatorship to a democracy).Exactly for these reasons, the case of Chile is particularly interesting.It is currently among the countries with larger social disparity and yet, according to the Newzoo's Global Mobile Market Report, the penetration of digital technologies remains the highest in Latin America.The expected effect is that easier access to digital news outlet should "democratize" the news consumption.Indeed, it has been found that newspapers and magazines have lost their elite character, diversifying in order to adapt to popular preferences throughout the 20th century while books remain a marker of 'elite culture' , and while status is not an important determinant of magazine and newspaper reading, education and income are [2].
News consumption patterns have been increasingly studied in the last decade, especially now that digital media have started providing a huge variety of new platforms that facilitate-and, on the other hand, makes more complex-the fruition of news.Access to news outlets from mobile phones, for example, is now more fragmented and shallower with respect to traditional media [3].Technology also drives the consumption of content in terms of continuity of access ("anywhere, anytime") [4].Previous literature [5] has also shown how individual choices with respect to news consumption can be influenced by the social status of the users.A strong and systematic association between status and newspaper readership has previously been established [6].Influence of education, income and age on newspaper consumption and platform preferences has also been explored [7].
The role of education is particularly critical.Researchers have found education to be a strong predictor of media behavior [8].It is also positively associated with general news exposure [9].In some context, news consumption is more unequally distributed than income, with greater social inequality in online news consumption than in off line news consumption [10].Within the particular context of Chile, news media have already been studied to outline, for example, their ownership structures or their political bias and to understand the effects of certain press manipulations by owners of the media outlet [11].
The contribution of this work is two-fold: first, we explore the association between socio-demographic attributes and news consumption in the Chilean context, in line with the literature we just described (that usually based the analyses in single-nation frameworks, like the US or Argentina) and our own previous work [11][12][13][14].Second, we rely upon a unconventional and unique data source: mobile phone data.Traditionally, this topic has been investigated using surveys among various population groups ( [15][16][17][18], just to mention a few).We believe that mobile technology can provide new insights that complement traditional methods of data collection.Indeed, many diverse research questions have been addressed using mobile communication data: to elaborate indices of the economic development of a region [19], to study traffic flows, urban human mobility, social mixing phenomena [20][21][22][23][24][25] as well as epidemic spreading on fine spatial and temporal scales [26].We believe that such a data source could add interesting new insights also in the context of this research area, besides the fact that these methods are less expensive, less time consuming, and they scale efficiently.
In the following sections we present the results of a study on users accessing on-line news media in the city of Santiago de Chile (SCL) through mobile devices (using DPI (deep packet inspection) data.The focus is on the time window that spans from the 6th of July 2016 to the 2nd of August 2016.In order to characterize each area by the sociodemographic information of its residents, we used the 2017 census data (see Sect. 2.1.1).We find that some socio-demographic attributes are highly associated to specific news media consumption patterns.In particular, education and age play a significant role in shaping the consumers behaviour even in the digital context, in agreement with a large body of literature on off-line media distribution channels.

Datasets and data pre-processing
In this section, we present a description of the data used and of the procedures that we followed for data cleaning and pre-processing.

Census data
To characterize Chile socio-economically, we used the official 2017 Chilean census, a surveying 17,574,003 people (51.1% females).Geo-politically, from largest to smallest, Chile consists of "regions", the largest administrative division, followed by "comunas" (similar to counties in the United States), census districts, census zones, and finally blocks, the finest level of geographical granularity.The public dataset is available at the level of blocks.Here, we focus on the Metropolitan Region of Santiago de Chile (RM, for short).The RM consists of 52 "comunas" for a total of 7,112,808 individuals (51.3% females), accounting for about 40% of the total population of Chile.In this work we correlate census information with data from mobile web traffic records.Since minors (persons whose age is <18) cannot sign mobile phone subscriptions, we decided to be conservative and only work with adults in the census to have a better match between the sample populations of the mobile data and census data sets, ending up with 5,450,592 individuals in the census and 2,455,148 accesses by unique users in the general phone dataset.
The 2017 Census contains a great deal of information on several aspects of the socioeconomic composition of the Chilean population.It has the issue that it is only an abridged version of the usual census surveys, since a more comprehensive one will be released in 2022.Given the particular structure of the questionnaire, we were able to perform a manual selection of the census features.Indeed, the vast majority of the questions in the survey revolve around a few topics (e.g., on top of the basic question "Do you consider yourself as a member of a native minority?"there are other additional questions to distinguish between the different native minorities).We thus decided to select our features only among the following basic socio-demographic information: • Age; • Escolaridad (years of formal education attained); • Student status (whether an individual is still studying or not); • Membership to a native population.
The census does not report on individuals' income directly.Thus, information about the economic situation had to be inferred.To do so, we chose escolaridad as a proxy.In Chile, at least, there is a well-known strong correlation between formal education and income distribution and inequality [27,28].Moreover, the variable escolaridad weighs heavily in the calculation of the Human Development Index (HDI), which is a widely used indicator of wellness and quality of life, that in turn can be used to get a first, qualitative measure of the soundness of our results (a map of Santiago HDI distribution at the level of municipalities can be seen in Fig. 1).
Census data can be aggregated at the different levels already mentioned: region, comuna, districts, zones and blocks.As we go into more granularity, census information becomes less specific in order to avoid identification and preserve privacy of individuals.We have chosen the level of district as a good trade-off between granularity, privacy preservation and availability of information, see Fig. 2. Finally, each value was expressed as a percentage-or a mean, depending on the typology of the quantity-over the population of each district.Other than the provisos made above, the rest of the census dataset was not modified or preprocessed in any other way.

Building a mobile web connections data set
We obtained a DPI (Deep Packet Inspection) data set from one of the largest company in terms of mobile subscriptions market share (around 30%) in Chile.This data set is a record of Internet connections from smartphones or any device that contains a SIM card and is subscribed to the telco network.This dataset consists of almost a month (between the 6th of July 2016 and the 2nd of August 2016) of anonymized events.An event is defined as a connection of an individual device to an IP address through a cell tower (or antenna).In order to preserve privacy, information was aggregated by antenna and by hour, without any single user information, making it virtually impossible to de-anonymize news consumers.
Our DPI data set includes the number of unique users that connected from an antenna to an IP address at a certain hour (00, 01, 02,. . ., 23), as in the following: In the case of the small sample above, the first row of the raw dataset tells us that from antenna 00000000, on July 6, 2016, there were two unique (i.e.distinct) users visiting 200.12.26.117 at 11 in the morning.
The urban area of Santiago de Chile contains about 15,000 cell phone signal receivers, called antennas.Most outdoor antennas are placed on top of poles called cell phone towers.There are about 1700 towers in our area of interest, with an average of 10 antennas per tower.There exist also standalone indoor antennas (sometimes called "small cells"), usually attached to indoor walls of large spaces such as hospitals, malls and public buildings.We have information on the exact latitude and longitude of the different towers and indoor antennas, allowing us to geo-reference our analyses.Abusing terminology, we call every unique geo-located cell phone receiver an "antenna", this may be a tower with dozens of antennas on it, or just one single indoor antenna in a hospital.In high demand areas, such as the financial districts or industrial complexes, there are tens of towers with tens of antennas (plus indoor antennas) within a small area of a few tens of meters.Which antennas a user connects to depends on many factors, including distance, traffic load (many devices may be connected to that antenna already), power, azimuth, among others.If the user moves a few meters, this could mean triggering the network to assign her a new antenna every few minutes.For that reason, to make the results more stable, in this paper, we group together all the antennas within a 1.1 km radius, obtaining a grid of about 700 points.We call these "dummy" antennas (the red dots in Fig. 2)."Dummy" towers are the result of clipping the full latitude and longitude to two decimals, so that instead of having the "real" antennas at high-precision (and much finer level of granularity), we now have a "dummy" antenna at the center of the 1.1 km 2 square cell.These antennas should be taken as our "sensors", the finest level of granularity we have access to, which will also be mapped to the census, as we will see later.
Our dataset is a part of the complete deep packet inspection dataset of the telco provider.This subset contains only requests to those IP addresses that belong to some kind of news media outlet.In order to navigate among these outlets, we used a curated list of news or-  ganizations analyzed in a previous work [11].These are around 400 news outlet accounts, b twenty six outlets for which we knew their economic and political bias [12].The process of identification of each IP address in order to associate the name of a website was not straightforward.There was often a many-to-one relationship between websites and IPs; for example, there are clusters of two or more websites that share the same IP.This is a critical issue, since in the dataset we can only see IP addresses, without any DNS reverse lookup.In any case, most of the websites that share the same IP belong to the same owner.This allows us to label each IP by its owner: this way we lose knowledge about the individual news outlet, that we are often unable to identify, but we still keep a satisfying amount of information by creating a unique matching between each IP and the editorial group (EG) that owns it.Thus, whenever needed, the EG was considered instead of the individual news media.Also, as shown in [11], the power structure in Chilean news media is strongly biased towards very few groups that share an identifiable editorial line.The only newspaper outside the above list is The Clinic, a Chilean satirical newspaper that is usually identified as leftist, which we added in order to cover the political spectrum as widely as possible, and use it as a baseline for informational extremes ("El Mercurio" would be right-conservative, and "The Clinic" would be left-liberal).The complete list of news outlets examined is shown in Table 2.
This selection is useful for various reasons: it allows us to remove some noise caused by the outliers, i.e. not-so-popular news media that display a very low number of connections, and, most importantly, it helps us focusing on a narrow but significant range of news outlets, for which we already have a quantitative knowledge of their political bias and of their popularity.Incidentally, with this filtering we were able to maintain in our analysis BioBioChile, El Mercurio and Cooperativa, that are the top 3 most visited outlets in our dataset, accounting for the vast majority of the volume of web traffic., thus not losing too much information.
After filtering by news outlet and considering only the urban area of Santiago de Chile, we have 4,313,964 events.Finally, since we are crossing mobile data with census, which contains sociodemographic information about the residents of a certain area, we need to take into account the phenomenon of the floating population, i.e. people moving from one place to another, especially at commuting hours and during working days.This matter has been extensively studied, and there are study cases set up in Santiago that show exactly how the city is affected by the phenomenon (and how mobile data could be used to understand urban mobility, see for example [29]).To tackle this issue, we examined the temporal patterns of the connections, illustrated in Fig. 3.By comparing the trends between the weekends (Saturdays and Sundays) and the working days (Mondays to Fridays) we can easily notice the circadian rhythms of the city: the peak in connections starts when people wake up, continues when they commute to work, rises again at lunch time and finally when they go back home at 6 pm.This last peak is what differentiates the working days from the weekends: on Saturdays and Sundays, no afternoon peak can be observed, with a smooth decline of the connections towards night hours instead.The typical effects of the floating population phenomenon seem much attenuated during the weekends, and the volume of connections seems to remain almost constant throughout the whole day.The correlation between the number of unique users and the number of residents in each comuna is rather high, with a Pearson coefficient P = 0.75 (p-value = 1.85e-07).This means that the geographical distribution of users accessing news media websites from mobile is a good proxy for the geographical distribution of the population.There is a clear trade-off in our choice of considering only the connections at the weekends: on the one hand we reduce our data to almost one quarter of the initial volume, i.e. ∼1M events, but on the other hand we are able to address the issue of the floating population, at least partially reducing the noise caused by this phenomenon.

Ethical considerations
Working with mobile data could raise privacy concerns.Of course, DPI data was handed out by the mobile provider already anonymized and grouped per tower: data appeared as shown in Table 1, displaying only the number of unique devices connected to a certain antenna, without any knowledge about the identity or the profile of the customer.Therefore, we work with highly aggregated data and, for good measure, the resolution was then made even coarser, as described in Sect.2.1.2.The same applies to census data: even though they are already anonymized and publicly available, the research was not carried out at the finest possible granularity, in order to have a good balance between available amount of information and privacy preservation.Our concern in relation to privacy stems from the possibility of cross-referencing raw DPI data with census information which could have unforeseen consequences and result in potential de-anonymization.
No attempt was made to infer statistics about individuals: the nature of DPI itself prevents this from happening.All the results are general trends of very large groups of residents-between hundreds of thousands and millions of people.

Mapping census data to districts
All the connections to the websites are grouped by hour and by antenna: this provides with a good approximation of the users' position in the city.Our goal is to study the connections based on the census features of the areas from where they are originated.Thus, we want to assign to each census district (CD) a label that helps us distinguishing between different segments of census.To estimate the most appropriate number of groups (and, therefore, labels), we cluster the CDs based on the census features that we selected earlier.We use a k-means algorithm, running the procedure several times for 2 ≤ k ≤ 30.Since we need a clear and easily interpretable grouping, we limit our choice to 3 ≤ k ≤ 8: having only 2 census groups would not be enough for such an analysis, while more than 8 would be hard to interpret.The values are standardized, and the algorithm was initialized with a random seed.For each k we compute the Gini coefficient of the distribution of the population (Fig. 4): a low Gini represents a situation of order, in which the population is almost equally distributed across all the clusters, while a high value is retrieved for an heterogeneous distribution throughout the different groups.
The final value, k = 5, is chosen as it is the value that maximizes the Gini coefficient and, thus, the heterogeneity of the distribution of the population.This is based on the assumption that a very ordered situation, in which we have almost the same number of people in each cluster, would be non realistic.
By setting k = 5 and inspecting the average values of the features for each group (see Table 3), we observe an almost hierarchical relationship among the different clusters.This is particularly true for the escolaridad variable, which correlates very well with the income, as mentioned above.The clusters were simply named K1-K2-K3-K4-K5, with K1 being the wealthiest area, and all the others following in an ordinal fashion.In light of this hierarchical relationship, we will denote these groups-or clusters-also as census levels.
The resulting map, shown in Fig. 5, resembles the distribution of the HDI shown in Fig. 1.In particular, the clustering procedure captures very well the presence of a very rich and highly segregated area (cluster K1) in the north-eastern part of the urban area, while the other census groups are distributed more heterogeneously throughout the city.

Study of the global spatial autocorrelation
As further evidence of the segregation that emerges from the k-means algorithm, we dig deeper into the spatial distribution of the census features.To do this, we measure Moran's  Index [30] on census data and test the global spatial autocorrelation of the features against a random null model.Moran's I for a variable y measured over n spatial units is defined as where w i,j are the elements of a spatial weight matrix W , S 0 = i j w ij and z i = y i -ȳ is the deviation of y from its mean value in the spatial unit i.The matrix W is essential in spatial autocorrelation analysis, since it provides the model with a measure of spatial contiguity.The definition of spatial contiguity is usually specified as a neighborhood relationship between spatial units.Since we are dealing with census districts, namely areas of the city, we decided to use the Queen neighborhood (Fig. 6).We believe that using a Rook neighborhood in a context of urban spatial analysis would be too limiting because, for each computation, we would not be considering the influence of the areas in the corners of the spatial unit under examination.Thus, we defined as contiguous all those areas that surround the zone we are observing, considering sufficient that they only meet at corner vertexes [31]: all those areas interact, communicate and most likely influence one another.
We measured the global spatial autocorrelation for the census since we need to corroborate any evidence of clustering resulting from the application of the k-means algorithm.The Moran's I is an univariate measure of correlation, hence we measured it for every census feature.All the results were then compared with the Moran's I calculated for the case of a completely random spatial distribution of the features.
In order to have a better understanding of the spatial autocorrelation and have a more local view, we examined the Moran scatterplot of the data points for each feature.The analysis of the Moran scatterplot was first introduced by Anselin [32], and it is based on the interpretation of the Moran's Index as a coefficient of an ordinary least squares (OLS) regression of the spatially lagged variable on the variable itself: where j w ij z j is the spatial lag of the variable z (which in turn is y expressed as deviation from the mean).Hence the spatial lag is a weighted average of the variable z over the neighbours of i.Note that the interpretation of I as a coefficient of an OLS regression is valid for any statistic that can be expressed as a ratio between a quadratic form and the sum of the squares [32], which is exactly how the Moran's I is defined.
Plotting the spatial lag of z as a function of z allows us to have a local view of the spatial autocorrelation.The plot is divided into four quadrants, going from the 1st quadrant of high values of both z and the spatial lag (high-high correlation) to the 4th quadrant of highlow correlation, passing through the so-called low-high and low-low correlation.A point being far up in the first quadrant of the Moran scatterplot of the average schooling year feature, represents an area that has a very high mean value of schooling, and is surrounded by neighbours with similarly high values.
As we can see in Fig. 7, there is strong evidence of segregation for almost all the census features, in particular for the extreme census segments K1 and K5 which, depending on the feature, are usually located way up near the bisector of the first and third quadrants.The census segments in between, instead, display a far more mixed situation.

Geo-referenced analysis of the DPI dataset
The labeling of the census zones based on socio-demographic features finally enables us to proceed with a geo-referenced analysis of the connections towards the selected news outlets websites.The goal is to find differences in the consumption of news media content by areas of the city that have different socio-demographical attributes.
A list of the news outlets can be found in Table 2: wherever we encountered the issues described in Sect.2.1.2,we grouped the news outlets by owner and considered the resulting group as an individual entity.This is the case of what happened with all the outlets belonging to the El Mercurio editorial group, which we included in our final list as a single entry.
In Fig. 9 we plot, for each News Outlet (NO), the number of unique users in each CD, normalized over the number of residents of the cluster the CD belongs to. Figure 10 shows the same quantity considering all the NOs together, i.e. the total amount of users connecting to news outlets websites during the examined weekends of July and August 2016.
Before analyzing these plots, it is worth going back to Table 3.The grouping into 5 clusters portrays a very clear division between the first two census segments and the remaining three.Indeed, groups K1 and K2 display a very high average value of schooling years (> 16), with group K2 being composed of much younger individuals and a higher percentage of students.Groups K3, K4 and K5 instead are more comparable with each other, displaying Figure 7 Moran scatterplot for the census features.It is clear how, for all the features but the mean age, the census segments (K1 and K5 in particular) are strongly segregated.We observe different patterns for the mean age, where we can see how K2 is shifted to the left, towards a lower average value of the age a progressively lower value of schooling years and percentage of students, and a higher value of people of indigenous ethnicity.As pointed out in Sect.2.3.1, these three groups are spatially mixed, while K1 and K2 are segregated in the north-eastern and, to a lower extent, in the central part of the urban area of the Region Metropolitana de Santiago.
In light of this grouping, we can inspect the results shown in the plots.From Fig. 10 we can gather the first, clear evidence: group K2 completely dominate the chart.This means that the majority of the connections are originated in those areas that are characterized by a high percentage of young and highly educated residents.As said, for each cluster the connections have been normalized over the number of people in it, meaning that we are looking at a measure of the activity of a cluster with respect to its size.This dominance of the group K2 and, in smaller measure, of K1, is confirmed also by computing the Pearson correlation between the number of connections and the number of residents in each cluster.We get P = -0.92(p-value = 0.029), a very high anti-correlation, meaning thatin proportion to their size-the smallest clusters show the highest activity.Indeed, the smallest clusters are K2 (with only 255,699 people) and K1 (682,053 inhabitants), while the remaining three all have more than 1M residents each.We can also analyze the same trends for the individual news media websites, shown in Fig. 8 and disaggregated by clusters in Fig. 9. From Fig. 9 we can see that the areas belonging to group K2 are again on top in the majority of the plots.The only exception is for the websites belonging to El Mercurio, a right wing editorial group, which is most accessed by those areas belonging to the K1 segment.Similar numbers between K1 and K2 can be found also for the website of Diario Financiero, a financial newspaper.As for the other census groups no clear pattern can be identified, but still it seems pretty clear that, both in Figs. 9 and 10, group K3 is always the least active.In general, segments K3, K4 and K5 display an activity comparable to that of groups K1 and K2 only for those news media, like Biobio, Cooperativa or Tele13, that can be classified as containers of generic news and information, unlike Diario Financiero, which is highly specialised on financial matters, or El Mercurio, which is openly politically oriented.
As expected ([24, 33, 34]), the circadian rhythm of the population can still be seen from these figures, even though to a lower extent than the same analysis performed on the weekdays (Fig. 3), as explained in Sect.2.1.2.The decline in connections towards night hours is smooth, and the trends are very similar across all the census segments.

Addressing the limitations
This method presents some limitations. of all, we acknowledge a possible bias in our dataset: while census is representative of the whole population, mobile data accounts only for the customers of the service provider, which can be biased towards a certain segment of the population depending on multiple factors (typologies of the proposed contracts, marketing strategies, etc.).Nonetheless, bearing in mind that this provider alone owns more than the 37% of the marketshare, a correlation check between our dataset and the number of residents in each census zone gave us satisfactory results, as pointed out in Sect.2.1.2.Another influence on the results could come from the diverse penetration rate of WiFi technologies among the various census groups.One of our basic assumptions is that people, during weekends, access the web mostly from their houses or from nearby locations.This means that, especially in their houses but also in some public places, people could be surfing the web via WiFi instead of using their own mobile data.This, again, strongly depends on the specific country and on its telco market, and it could also be related to the fact that-in general-WiFi is more popular among those individuals that have, for example, a bank account (sometimes needed even just to get a broadband contract in the first place) or at least a certain yearly income.Thus, we acknowledge that these census groups, in particular K1 and K2 segments according to our clustering algorithm, could be slightly under-represented in our dataset.Nonetheless, the positive correlation between the number of unique accesses and the residents in each municipality is a reassuring evidence with respect to this matter.Moreover, for the particular case of Chile, the effects of this issue should be limited: according to official data (Subtel), nearly 80% of all Internet access is made from mobile phones [35].
A further limitation of this study is data availability.We are under strict non-publication policies from the telco, and the data cannot leave the telco's premises.This is to be expected given the novelty of the data sets we are working with, and their yet-to-beunderstood privacy implications (this notwithstanding, please see the "Data Availability" section below)

Conclusions
The main purpose of this work is to understand whether individuals living in different areas of the city of Santiago de Chile display different behaviors in terms of access to digital news media content with respect to socio-demographic attributes.In order to do so, we studied a record of a month (July-August 2016) of anonymized and geo-referenced accesses to several websites of Chilean news media through the cellphone network of a major telco provider.To reduce the effects of the floating population phenomenon, we analyzed only a portion of the connections, narrowing the window to the weekends only.
By applying a k-means algorithm on official Chilean census data, in order to group the population into 5 different census levels, we find that the wealthiest areas of the city, K1 and K2, are very similar in terms of years of schooling, with K2 members being-on average-much younger than K1.The other three clusters (K3-K4-K5) are indeed pretty far from the first two in terms of average education level, and far more spatially mixed: K1 and K2 are mostly grouped in the north-eastern quadrants of the city.
The news access patterns tell us that segment K2 is overall the most active group.This means that the most highly educated and youngest are also the most informed, with re-spect to older people with similar education level.If we break down the number of access by individual media outlet, we obtain a more composite picture (Fig. 3).K2 remains the most comprehensively informed with respect to the diversity of media outlets; on the other hand the ranking of the other groups can vary depending on the flavour of the news media.
These findings confirm empirically that socio-demographic features have an influence on the consumption of news.Our results suggest that the consumption of news media content does not always necessarily increase with the education level of the user.While it is true that the highly educated are those who access news media websites the most, the opposite cannot be inferred: indeed group K3, which includes all those areas of the city whose residents are averagely educated (see Table 3), is the cluster that displays the lowest activity.
The predominance of group K2 also tells us that age plays a key role in this context.Young generations are likely more comfortable in using mobile devices in their daily lives, and this has been also explored in the specific field of news consumption [36].
In summary, although we found clear correlations between the consumption of online news media content and the socio-demographics of users, this relationship seems to be non-trivial.While the highly educated are the most eager users of mobile news media outlets, there are important clues that a group of well educated people, that we could identify as middle-class, show a lack of interest in accessing news media via mobile.This could have a significant impact on the civic engagement of a relevant share of the population, as well as on the economical and political life of a Country, not mentioning the media market itself.
There are positive correlations between the other two features-student status and percentage of people of native ethnicity.Therefore, a more detailed analysis of the ties between these features and the consumption patterns of the individual news outlets-namely, the habits and preferences of students and native minorities-could be of much interest in the context of studying inequalities in the accesses between different cohorts of the population, and it is left as future work.
We believe that this work could be useful to shed some light on how digital platforms can contribute to the already complex interplay between socio-demographic characteristics of the population and the news consumption behaviour.Additional studies with more comprehensive datasets in other social contexts are needed, to assess the full impact of digital technologies on this topic, as well as to understand if these findings are highly Countryspecific or if there is a shared worldwide-trend.

Figure 1
Figure 1 HDI distribution for the urban area of the Region Metropolitana de Santiago

Figure 2
Figure 2 Census districts, comunas and dummy towers in the urban area of Santiago.The census data is available at a level of census districts, the small areas surrounded by white borders.In the Figure are also shown the boundaries of the comunas (black lines), the administrative areas into which the city is divided.The red dots represent the dummy towers, obtained by clipping the coordinates of each antennas to the second decimal digit.All the traffic flow outgoing from each antenna was aggregated in the dummy towers to which it belongs

Figure 3
Figure 3 Comparison between connections on weekends and working days.Values are normalized over the population of the area from where the connection is made.The dashed lines correspond to commuting hours: it is evident how during the weekends the effect of the floating population are strongly attenuated

Figure 4
Figure 4 Gini coefficient for different values of K. K = 5 was chosen as it maximizes the coefficient.The green area refers to the values of K that we kept into consideration for the final choice

Figure 5 Figure 6
Figure 5 Result of the k-means clustering on the census districts

Figure 8
Figure 8 Unique users over the whole time window studied, detailed by hour and by news outlet.On the horizontal axis we have the time of the day, and on the vertical axis the normalized accesses.From these plots clearly emerges the circadian rhythm followed by the temporal patterns of most of the news outles

Figure 9
Figure 9 Unique users over the whole time window studied, detailed by hour, by news outlet and by cluster.On the horizontal axis we have the time of the day, and on the vertical axis the normalized number of users.These plots show us how the different clusters contribute to the web traffic of the individual news outlets

Figure 10
Figure 10 Total connections, detailed per hour and disaggregated by census segment of the area of origin

Table 1
The first lines of the DPI dataset

Table 2
List of news outlets (or editorial groups) examined in the DPI data set

Table 3
Insights of the clusters.In these table are shown the mean values of the features for each cluster