Touristic site attractiveness seen through Twitter

Tourism is becoming a significant contributor to medium and long range travels in an increasingly globalized world. Leisure traveling has an important impact on the local and global economy as well as on the environment. The study of touristic trips is thus raising a considerable interest. In this work, we apply a method to assess the attractiveness of 20 of the most popular touristic sites worldwide using geolocated tweets as a proxy for human mobility. We first rank the touristic sites based on the spatial distribution of the visitors' place of residence. The Taj Mahal, the Pisa Tower and the Eiffel Tower appear consistently in the top 5 in these rankings. We then pass to a coarser scale and classify the travelers by country of residence. Touristic site's visiting figures are then studied by country of residence showing that the Eiffel Tower, Times Square and the London Tower welcome the majority of the visitors of each country. Finally, we build a network linking sites whenever a user has been detected in more than one site. This allow us to unveil relations between touristic sites and find which ones are more tightly interconnected.

Tourism is becoming a significant contributor to medium and long range travels in an increasingly globalized world. Leisure traveling has an important impact on the local and global economy as well as on the environment. The study of touristic trips is thus raising a considerable interest. In this work, we apply a method to assess the attractiveness of 20 of the most popular touristic sites worldwide using geolocated tweets as a proxy for human mobility. We first rank the touristic sites based on the spatial distribution of the visitors' place of residence. The Taj Mahal, the Pisa Tower and the Eiffel Tower appear consistently in the top 5 in these rankings. We then pass to a coarser scale and classify the travelers by country of residence. Touristic site's visiting figures are then studied by country of residence showing that the Eiffel Tower, Times Square and the London Tower welcome the majority of the visitors of each country. Finally, we build a network linking sites whenever a user has been detected in more than one site. This allow us to unveil relations between touristic sites and find which ones are more tightly interconnected.

INTRODUCTION
Traveling is getting more accessible in the present era of progressive globalization. It has never been easier to travel, resulting in a significant increase of the volume of leisure trips and tourists around the world (see, for instance, the statistics of the last UNWTO reports [1]). Over the last fifty years, this increasing importance of the economic, social and environmental impact of tourism on a region and its residents has led to a considerable number of studies in the so-called geography of tourism [2]. In particular, geographers and economists have attempted to understand the contribution of tourism to global and regional economy [3][4][5][6][7] and to assess the impact of tourism on local people [8][9][10][11][12][13].
These researches on tourism have traditionally relied on surveys and economic datasets, generally composed of small samples with a low spatio-temporal resolution. However, with the increasing availability of large databases generated by the use of geolocated information and communication technologies (ICT) devices such as mobile phones, credit or transport cards, the situation is now changing. Indeed, this flow of information has notably allowed researchers to study human mobility patterns at an unprecedented scale [14][15][16][17][18][19]. In addition, once these data are recorded, they can be aggregated in order to analyze the city's spatial structure and function [20][21][22][23][24][25][26][27] and they have also been successfully tested against more traditional data sources [28][29][30]. In the field of tourism geography, these new data sources have offered the possibility to study tourism behavior at a very high spatiotemporal resolution [18,[31][32][33][34][35][36][37].
In this work, we propose a ranking of touristic sites worldwide based on their attractiveness measured with geolocated data as a proxy for human mobility. Many different rankings of most visited touristic sites exist but they are often based on the number of visitors, which does not really tell us much about their attractiveness at a global scale. Here we apply an alternative method proposed in [38] to measure the influence of cities. The purpose of this method is to analyze the influence and the attractiveness of a site based on the average radius traveled and the area covered by individuals visiting this site. More specifically, we select 20 out of the most popular touristic sites of the world and analyze their attractiveness using a dataset containing about 10 million geolocated tweets, which have already demonstrated their efficiency as useful source of data to study mobility at a world scale [18,38]. In particular, we propose three rankings of the touristic sites' attractiveness based on the spatial distribution of the visitors' place of residence, we show that the Taj Mahal, the Pisa Tower and the Eiffel Tower appear always in the top 5. Then, we study the touristic site's visiting figures by country of residence, demonstrating that the Eiffel Tower, Times Square and the London Tower attract the majority of the visitors. To close the analysis, we focus on users detected in more than one site and explore the relationships between the 20 touristic sites by building a network of undirected trips between them.

MATERIALS AND METHODS
The purpose of this study is to measure the attractiveness of 20 touristic sites taking into account the spatial distribution of their visitors' places of residence. To do so, we analyze a database containing 9.  removed from the data by identifying users tweeting too quickly from the same place, with more than 9 tweets during the same minute and from places separated in time and space by a distance larger than what is possible to be covered by a commercial flight (with an average speed of 750 km/h). Their spatial distributions and that of the touristic sites can be seen in Figure 1.
In order to measure the site attractiveness, we need to identify the place of residence of every user who have been at least once in one of the touristic sites. First, we discretize the space by dividing the world into squares of equal area (100 × 100 km 2 ) using a cylindrical equal-area projection. Then, we identify the place most frequented by a user as the cell from which he or she has spent most of his/her time. To ensure that this most frequented location is the actual user's place of residence the constraint that at least one third of the tweets has been posted from this location is imposed. The resulting dataset contains about 59, 000 users' places of residence. The number of valid users is shown in Table 1 for each touristic site. In the same way, we identify the country of residence of every user who have posted a tweets from one of the touristic sites during the time period.
Two metrics have been considered to measure the attractiveness of a touristic site based on the spatial distribution of the places of residence of users who have visited this site: • Radius: The average distance between the places of residence and the touristic site. The distances are computed using the Haversine formula between the latitude and longitude coordinates of the centroids of the cells of residence and the centroid of the touristic site. In order  • Coverage: The area covered by the users' places of residence computed as the number of distinct cells (or countries) of residence.
To fairly compare the different touristic sites which may have different number of visitors, the two metrics are computed with 200 users' place of residence selected at random and averaged over 100 independent extractions. Note that unlike the coverage the radius does not depend on the sample size but, to be consistent, we decided to use the same sampling procedure for both indicators.

Touristic sites' attractiveness
We start by analyzing the spatial distribution of the users' place of residence to assess the attractiveness of the 20 touristic sites. In Figure 2a and Figure  2b, the touristic sites are ranked according to the radius of attraction based on the distance traveled by the users from their cell of residence to the touristic site and the area covered by the users' cells of residence. In both cases, the results are averaged over 100 random selection of 200 users. The robustness of the results have been assessed with different sample sizes (50, 100 and 150 users), we obtained globally the same rankings for the two metrics. Both measures are very correlated and for most of the site the absolute difference between the two rankings is lower or equal than 2 positions. However, since the metrics are sensitive to slightly different information both rankings also display some dissimilarities. For example, the Grand Canyon and the Niagara Falls exhibit a high coverage due to a large number of visitors from many distinct places in the US but a low radius of attraction at the global scale.
To complete the previous results, we also consider the number of countries of origin averaged over 100 random selection of 200 users. This gives us new insights on the origin of the visitors. For example, as it can be observed in Figure 3, the visitors of the Grand Canyon are mainly coming from the US, whereas in the case of the Taj Mahal the visitors' country of residence are more uniformly distributed. Also, it is interesting to note that in most of the cases the nationals are the main source of visitors except for Angkor Wat ( Table 2). Some touristic sites have a national attractiveness, such as the Mont Fuji or Zocalo hosting about 84% and 93% of locals, whereas others have a more global attractiveness, this is the case of the Pisa Tower and the Machu Pichu welcoming only 21% of local visitors.
More generally, we plot in Figure 2c the ranking of touristic sites based on the country coverage. The results obtained are very different than the ones based on the cell coverage ( Figure 2b). Indeed, some touristic sites can have a low cell coverage but with residence cells located in many different countries, this is the case of the Pyramids of Giza, which went up 7 places and appears now in second position. On the contrary, other touristic sites have a high cell coverage but with many cells in the same country, as in the previously mentioned cases of the Grand Canyon and the Niagara Falls. Finally, the ranks of the Taj Mahal, the Pisa Tower and Eiffel Tower are consistent with the two previous rankings, these three sites are always in the top 5. Finally, we compare quantitatively the rankings with the Kendall's τ correlation coefficient which is a measure of association between two measured quantities based on the rank. In agreement with the qualitative observations, we obtain significant correlation coefficients comprised between 0.66 and 0.77 confirming the consistency between rankings obtained with the different metrics.

Touristic site's visiting figures by country of residence
We can also do the opposite by studying the touristic preferences of the residents of each country. We extract the distribution of the number of visitors from each country to the touristic sites and normalize by the total number of visitors in order to obtain a probability distribution to visit a touristic site according to the country of origin. This distribution can be averaged over the 70 countries with the higher number of residents in our database (gray bars in Figure 4). The Eiffel Tower, Times Square and the London Tower welcome in average 50% of the visitors of each country. It is important to note that these most visited touristic sites are not necessarily the ones with the higher attractiveness presented in the previous section. That is the advantage of the method proposed in [38], which allows us to measure the influence and the power of attraction of regions of the world with different number of local and non-local visitors.
We continue our analysis by performing a hierarchical cluster analysis to group together countries exhibiting similar distribution of the number of visitors according to the touristic sites. Countries are clustered together using the ascending hierarchical clustering method with the average linkage clustering as agglomeration method and the Euclidean distance as similarity metric, respectively. To choose the number of clusters, we used the average silhouette index [39]. The results of the clustering analysis are shown in Figure 4. Two natural clusters emerge from the data, these clusters are without surprise composed of countries which tend to visit in a more significant way touristic sites located in countries belonging to their cluster. The first cluster gather countries of America and Asia whereas the second one is composed of countries from Europe and Oceania.

Network of touristic sites
In the final part of this work, we investigate the relationships between touristic sites based on the number of Twitter users who visited more than one site during a time window between September 2010 and October 2015. More specifically, we built an undirected spatial network for which every link between two toursitc sites represents at least one user who has visited both sites. As a co-occurence network, the weight of a link between two sites is equal to the total number of users visiting the connected sites. The network is represented in Figure 5 where the width and the brightness of a link is proportional to its weight and the size of a node is proportional to its weighted degree (strenght). The Eiffel Tower, Times Square, Zocalo and the London Tower appear to be the most central sites playing a key role in the global connectivity of the network ( Table 3). The Eiffel Tower alone accounted for a 25% of the total weighted degree. The three links exhibiting the highest weights connect the Eiffel Tower with Time Square, the London Tower and the Pisa Tower representing 30% the total sum of weights. Zocalo is also well connected with the Eiffel Tower and Time Square representing 11% of the total sum of weights.

DISCUSSION
We study the global attractiveness of 20 touristic sites worldwide taking into account the spatial distribution of the place of residence of the visitors as detected from Twitter. Instead of studying the most visited places, the focus of the analysis is set on the sites attracting visitors from most diverse parts of the world. A first ranking of the sites is obtained based on cells of residence of the users at a geographical scale of 100 by 100 kilometers. Both the radius of attraction and the coverage of the visitors' origins consistently point toward the Taj Mahal, the Eiffel tower and the Pisa tower as top rankers. When the users' place of residence is scaled up to country level, these sites still appear on the top and we are also able to discover particular cases such as the Grand Canyon and the Niagara Falls that are most visited by users residing in their hosting countries. At country level, the top rankers are the Taj Mahal and the Pyramids of Giza exhibiting a low cell coverage but with residence cells distributed in many different countries.
Our method to use social media as a proxy to measure human mobility lays the foundation for even more involved analysis. For example, when we cluster the sites by the country of the origin of their visitors, two main clusters emerge: one including the Americas and the Far East and the other with Europe, Oceania and South Africa. The relations between sites have been also investigated by considering users who visited more than one place. An undirected network was built connecting sites visited by the same users. The Eiffel Tower, Times Square, Zocalo and the London Tower are the most central sites of the network.
In summary, this manuscript serves to illustrate the power of geolocated data to provide world wide information regarding leisure related mobility. The data and the method are completely general and can be applied to a large range of geographical locations, travel purposes and scales. We hope thus that this work contribute toward a more agile and cost-efficient characterization of human mobility.  Figure 5. Network of undirected trips between touristic sites. The width and the brightness of a link is proportional to its weight. The size of a node is proportional to its weighted degree.