- Regular article
- Open Access
Nowcasting earthquake damages with Twitter
© The Author(s) 2019
- Received: 22 August 2018
- Accepted: 24 January 2019
- Published: 31 January 2019
The Modified Mercalli intensity scale (Mercalli scale for short) is a qualitative measure used to express the perceived intensity of an earthquake in terms of damages. Accurate intensity reports are vital to estimate the type of emergency response required for a particular earthquake. In addition, Mercalli scale reports are needed to estimate the possible consequences of strong earthquakes in the future, based on the effects of previous events. Emergency offices and seismological agencies worldwide are in charge of producing Mercalli scale reports for each affected location after an earthquake. However, this task relies heavily on human observers in the affected locations, who are not always available or accurate. Consequently, Mercalli scale reports may take up to hours or even days to be published after an earthquake. We address this problem by proposing a method for early prediction of spatial Mercalli scale reports based on people’s reactions to earthquakes in social networks. By tracking users’ comments about real-time earthquakes, we create a collection of Mercalli scale point estimates at municipality (i.e., state subdivisions) level granularity. We introduce the concept of reinforced Mercalli support, which combines Mercalli scale point estimates with locally supported data (named ‘local support’). We use this concept to provide Mercalli scale estimates for real-world events by providing smooth point estimates using a spatial smoother that incorporates the distribution of municipalities in each affected region. Our method is the first method based on social media that can provide spatial reports of damages in the Mercalli intensity scale. Experimental results show that our method is accurate and provides early spatial Mercalli reports 30 minutes after an earthquake. Furthermore, we show that our method performs well for earthquake spatial detection and maximum intensity prediction tasks. Our findings indicate that social media is a valuable source of spatial information for quickly estimating earthquake damages.
- Event damage assessment
- Mercalli intensities
Mercalli reports provide crucial information for timely emergency response and planning. Therefore, government agencies related to emergency management and geological centers strive to provide intensity reports in a timely and accurate manner to help mitigate disaster effects .
Mercalli reports are prepared by observers that have been appointed to different geographical areas. However, not all locations have appointed observers worldwide. Due to the human effort involved in producing intensity reports, these are commonly released hours or even days after a seismic movement. Many factors can obstruct the production of fast reports, among them the quality of communications during a disaster or the observer availability . To improve intensity reporting, agencies such as the United States Geological Survey (USGS) have even created crowd-sourcing tools to collect this data online from regular people.1 On the other hand, social media users are regarded as providers of timely information, which allows the characterization of physical-world events . Current advances show promising results in the direction of event analysis. Some of the most noteworthy studies address automatic event description and summarization using information provided by users as a situational information source . However, social media data poses important challenges for information extraction, requiring researchers to design sophisticated methods for extracting useful knowledge from noisy data . In the specific scenario of earthquake disaster management, the state-of-the-art shows efforts in several tasks, such as earthquake detection and damage area detection [6–8], as well as maximum Mercalli intensity estimation . These works are focused on the earthquake detection in a heavily populated city in which the earthquake was perceived by inferring the Mercalli intensity. As more populated cities may have many social media users, these events may produce a trending topic. Related work shows that it is possible to infer the maximum intensity in the Mercalli scale using trending topics. However, these methods have only been studied for the detection of high-energy events. Medium-scale events produce noisy data and often remain unexplored due to technical limitations.
We address the problem of early Mercalli scale estimation by studying how social media can contribute to this task. In particular, we propose a new approach, which focuses on the spatial estimation of Mercalli intensities and makes use of the volume and freshness of social data. We use municipalities as units of spatial aggregation, mapping posts to cities according to users’ locations.
Social media data can be very noisy. We deal with noise by introducing the concept of “Reinforced Mercalli Support,” a key building block of our method. The idea behind “Reinforced Mercalli Support” is to process user posts that have high support at municipality level. Locally supported posts help us detect local trends, validating data that might be otherwise be ignored at more aggregated level of analysis (e.g., at state or country level). The spatial dimension of the problem, namely how people are distributed across a territory, and how this information affects the intensity inference process, is another key component of our proposal. We use spatial smoothing to deal with this aspect of the problem. Our method is inspired on the procedure used to elaborate Mercalli reports. These reports, which are based on data provided by on-the-scene experts, make use of the spatial distribution of the observers.
Our findings show that our approach can automatically provide spatial Mercalli reports in a timely and accurate manner. In particular, in this article we extend the following contribution introduced in our prior work :
“We successfully deal with data at municipality level, detecting local trends in the specific task of maximum Mercalli intensity detection”.
We present the first approach that addresses the problem of spatial Mercalli intensity inference based entirely on social media data.
We introduce a new concept, the reinforced Mercalli support estimate, which successfully combines local trends and local support in a single variable.
We show empirically, using real-world data, that our method provides accurate and fast Mercalli reports at a fine level of spatial granularity.
Our method can produce a spatial Mercalli report thirty minutes after an earthquake, contributing to improve current intensity estimation time and provides additional information to that given by human observers.
The paper is organized as follows. Section 2 presents a review of the relevant literature. In Sect. 3 we introduce our method proposal. In Sect. 4 we present our experimental validation, and we conclude in Sect. 5 with a discussion, conclusions, and outline of future work.
Twitter is an online social network with more than 300 million active users per month.2 This platform has become an huge source of real-time user-generated content. A sample of Twitter’s streaming content is publicly available through the platform’s API. These features have sparked notorious scientific interest during the past decade. Research includes, among others, studies to understand collective behavioral patterns , and to find correlations between social media and physical-world events .
During disaster situations and emergencies, many changes in the behavior of social media users have been observed, producing a mass convergence phenomena . Social media has allowed researchers to gain insight into the dynamics of information propagation during crisis situations [5, 13, 14], displaying collaborative patterns useful for information filtering , and for the assessment of information credibility . Social data has also been used to analyze, for example: forest fires , power outages in electrical systems , large-scale protests , and bus accidents in public transportation . All of these systems place their efforts in the arena of providing situational awareness (local and timely information) of an event. Specifically, research has proven that social media can be valuable for rapidly assessing damage during large-scale disasters. Vieweg et al.  show how Twitter contributes enhance situational awareness during two natural hazards: the Oklahoma Grassfires and the Red River Floods in the U.S. Kryvasheyeu et al. , on the other hand, studied Hurricane Sandy and discovered a strong relationship between hurricane-related Twitter activity and the actual path of the hurricane. Furthermore, they showed that for major disasters there is a correlation between damage and social media activity. Other efforts have focused on communication infrastructure during earthquakes by providing methods to favor message sharing during disasters in mobile networks , providing access to spatio-temporal data during disasters , and testing communication infrastructure using simulated data . Along this line, several text mining methods have been used to elaborate reports, providing event summarization  (a short textual description of the earthquake) or detecting local related events such as looting and pillage .
Earthquake detection and analysis using social media is a particularly active field of study. The first efforts in this subject date from the year 2010, where the correlation between the number of tweets and the intensity of an earthquake was observed for the first time.3 In 2011, during the earthquake in Tohoku, researchers noted the existence of a high correlation between the number of user posts on Twitter (known as tweets) and the earthquake’s intensity in certain locations [27, 28]. The relationship between tweet rates and Mercalli intensity was later revisited by Kropivnitskaya et al. . They showed that tweets and Mercalli intensity correlated during three different earthquakes located in California, Japan, and Chile during 2014. An a related study, Crooks et al.  analyze the spatio-temporal characteristics of the relationship between an earthquake’s seismic wave and social media posts. They show that Twitter data is comparable to that gathered by specialized crowdsourcing initiatives,4 and more rapid.
In addition, several earthquake alert systems have been created for different countries, such as for Australia , for Japan , and for Italy , as well as more general worldwide monitoring systems [8, 12]. Most of these systems use some type of burst detection algorithm over the tweet stream to report an earthquake, where a burst is defined as a large number of occurrences of tweets within a short time window . These systems are focused on the specific task of earthquake detection, namely under which conditions we may confirm or deny that an earthquake reported in social media really occurred. Despite that the primary goal of these systems is to report that an earthquake happened in a given location, they have shown that it is possible to infer more information from social media data. Some of the most salient results on seismic event reports rely on the estimation of the epicenter of an earthquake using only information recovered from Twitter [6, 33]. Also, TwiFelt , which is an online system, uses the Twitter stream to estimate of the area in which an earthquake was felt in Italy. The system uses only geolocated tweets, with good performance for high-intensity earthquakes. However, reliability in this case depends on the existence of geolocated tweets, which can be scarce in many countries (between 4% and 7%) .
Regarding the problem of earthquake intensity estimation, Burks et al.  showed that Twitter can provide useful data to estimate shaking intensity. They proposed an approach that combines earthquake characteristics, measured using seismographs (such as moment magnitude, source-to-site distance, and wave velocity), with Twitter data (extracted from tweets that contained the term ‘earthquake’). Conditioned by a set of reports retrieved from seismographs, they segmented the area around each recording station into nine radial subareas. They mapped to each of these areas, according to GPS location, all of the earthquake-related tweets produced during the 10-minutes posterior to an earthquake. They computed lexical features for each disc to study the correlation of these features with the Mercalli intensity. The authors showed good prediction of earthquake shaking intensity when combining earthquake measurements with tweets. In our work we use some of the same tweet features used by Burks et al. However, our method differs from theirs in that our goal is to provide rapid Mercalli estimates only using social media data, without seismograph recordings. We aim towards understanding the full contribution of social media for intensity estimation, as well as avoiding dependence on a dense seismographic networks.
Possibly, the closest to our proposal is the work of Cresci et al. . In their approach, the authors studied how to estimate the maximum intensity in the modified Mercalli scale using only Twitter features. Using linear regression models over a collection of aggregated features, testing 45 different attributes. They showed that Twitter has enough predictive power to infer the maximum intensity of an earthquake. The set of features tested were extracted from user profiles, from tweet content, and from time-based features of the Twitter stream (e.g., tweet interval rates). Our proposal extends the work of by Cresci et al., but with focus on the value of message content and on producing accurate spatially distributed estimations (not only maximum intensity prediction). Another difference is that our method uses only 12 lexical features, reducing dimensionality. In addition, we produce a spatial report of the event by enriching the process with spatial information like the geographical distribution of users. Nevertheless, in Sect. 4 we compare our method with Cresci et al. in the specific task of maximum intensity prediction.
In terms of spatially distributed data, we did not find at this moment prior work that generates spatial reports for earthquake intensity prediction. However, there is prior work that deal with producing spatial metrics, such as ours, for other types of natural hazards [22, 30, 35]. These works face similar methodological issues to our approach, related to accurate geolocation of messages, data density, and distance to the event location. Regarding the use of geolocation authors such as Yin et al.  have shown that location accuracy can be improved by inferring locations from text at different geographical levels. For a more complete overview of the use of social media for mass emergencies and the challenges that this involves we suggest referring to the survey by Imran et al. 
3.1 Earthquake social effect characterization
Our approach uses Twitter as a data source of user-generated information about earthquakes. To obtain this data, every time an earthquake hits, we retrieve messages from the social platform that match any of the following keywords: sismo, temblor, temblando and terremoto, which loosely correspond to the terms seismic, quake, shaking and earthquake in Spanish. We collect these messages from the time of the earthquake up until 30 minutes afterwards. Given that our goal is to produce early Mercalli estimates we do not use any data posterior to 30 minutes after the event, because at that point the first partial Mercalli reports (produced by specialized agencies) start to appear.
Municipality-level features per event. The first eleven features are calculated over the set of tweets that correspond to a specific municipality. The last feature corresponds to the municipality population
NUMBER OF TWEETS
No of tweets produced in the municipality
No of tweets produced in the municipality
divided by the number of users in the municipality
Avg. tweet length (in number of words)
Avg. tweet length (in number of chars)
Fraction of tweets with…
Fraction of tweets containing the …
…# (hashtag) symbol
…@ (mention) symbol
In order to perform the municipality-level aggregation of data required by our approach we must examine each tweet for geolocation information. When available, the geolocation allows us to map a message back to the geographical area where it was originated. Hence, we will only use those messages for which we are able to extract a valid geolocation. In order to geolocate tweets we use the following steps: (1) if available, we extract the exact GPS coordinates from the tweet’s location field, (2) if the location field was not provided by the user in their tweet, we then process the tweet’s textual content. This is, we analyze the message’s text (e.g., “Earthquake in Valparaiso!!!”) to label possible location mentions using Named Entity Recognition (NER), then for each labeled location we use a fuzzy string matching procedure5 in order to map the location to its corresponding municipality. (3) At last, if all else fails, we apply the same procedure as in (2) but this time to the text provided by the user in their profile information. We acknowledge that this procedure can be noisy, since not all locations will be accurately mapped. However, we believe that spatial patterns will still emerge. In this sense, more accurate methods for tweet geolocation could improve this aspect of our approach. Nevertheless, this is an open problem that for the time being we consider beyond the scope of our work.
Once all of the remaining tweets have been aggregated at municipality level, each municipality is processed to extract 12 features, detailed in Table 1. These features provide a high-level characterization of user activity related to the earthquake, for each geographical subdivision.
3.2 Region of interest estimation
The next stage of our approach is estimating which municipalities were affected by the earthquake. We refer to these municipalities as the region of interest of an earthquake. Only those municipalities deemed as being affected by the earthquake will be used for spatial Mercalli intensity estimation in the following stage. To estimate the geographical subdivisions that were affected by the seismic event, we use a supervised classification model. This model separates municipalities into two classes: unaffected by the earthquake and affected by the earthquake.
To create this model we used a 0/1 classification algorithm, which we trained using municipality-level data modeled as feature vectors (using the features shown in Table 1). The labels that we used for each municipality were class “0” if the earthquake was not perceived by the population (i.e., the municipality had no official Mercalli intensity value associated to it), and class “1” if the earthquake was perceived by the population (i.e., the municipality had an official Mercalli value associated to it). The Mercalli intensity values that we used to label the municipality-level data corresponded to values in official earthquake reports. More details on the technical and empirical aspects of the model creation are presented in Sect. 4.
3.3 Spatial Mercalli estimation
We next create spatial Mercalli estimates for the municipalities that are part of the region of interest of an earthquake. This process is divided into 3 steps: (i) reinforced Mercalli support estimation, (ii) adjusted Mercalli estimation, and (iii) spatial distribution of Mercalli intensities. We proceed to describe each of these steps.
3.3.1 Reinforced Mercalli support estimation
As a first step to estimate spatial Mercalli values, we define a municipality-level variable, which we call reinforced Mercalli support. The goal of this variable is to give more weight to intensity estimations that come from regions that displayed a larger amount of social activity. The rationale is to limit the effect of noisy reports by including only information with high local support.
Let i be the index that denotes a municipality belonging to the region of interest of a given earthquake. We then define the local support \(s(i) \in[0,1]\) of the ith municipality, as the ratio between users in i that reported the earthquake and the total number of different users in i which have reported earthquakes in the entire (training) dataset. Next, we define the Mercalli point estimate \(m(i)\) for the ith municipality, as an intermediate estimate for the Mercalli value of i, which is obtained using a regression model. To estimate \(m(i)\) we use a regression algorithm trained with earthquake Mercalli intensities and their corresponding municipality-level features. The Mercalli intensities used for each earthquake-municipality pair are based on official reports by governmental agencies. More details on the technical and empirical aspects of the regression model creation are discussed in Sect. 4.
3.3.2 Adjusted Mercalli estimation
3.3.3 Spatial distribution of Mercalli intensities
Municipalities are defined as areal partitions of a geographical region. Hence, it makes sense that to predict municipality-level Mercalli intensities we consider the effect of its spatial correlations with other nearby municipalities that are part of the region of interest. To do this, we smooth the adjusted Mercalli estimate of each municipality in relation to the adjusted Mercalli estimates of its nearest neighbors. The influence of a neighbor on a given municipality is inversely determined by its geodesic distance to the municipality. This distance is measured as the pairwise distance between the largest city of the municipality and its neighbor. Since the adjusted Mercalli estimate (Eq. 2) of a municipality is conditioned to the local support that it had for the event (Eq. 2) by considering the largest city in each municipality, we give a high level of confidence to the distance estimation.
The idea behind this spatial smoothing is to provide a robust Mercalli estimation for municipalities that did not have sufficient support to provide a fair point estimate. This problem affects rural areas where Internet access is limited and/or marginal in proportion to the municipality’s population. In these locations, the use of spatial smoothing is helpful to infer a Mercalli intensity estimate even when a low number of Twitter reports are provided.
In this section we present the experimental validation of our proposed method. In particular, we evaluate the performance of our approach for estimating Mercalli values at municipality level based on social media activity during earthquakes. First, we present a description and characterization of our datasets, and secondly we present our results.
4.1 Dataset description and characterization
We use two datasets, the first a ground truth dataset obtained from a seismological agency and the second is a Twitter dataset from which our method performs its Mercalli estimations. We describe them next.
4.1.1 Ground truth earthquake dataset
As a ground truth dataset we used an earthquake catalog provided publicly by the National Seismological Center of Chile, also known internationally as GUC. This catalog contains information about earthquakes registered in Chile from January 2016 to June 2017. This information is provided at municipality level and includes event magnitude, reported in Moment Magnitude scale and, if the earthquake was perceived by the population, it contains as well its Mercalli intensity report. The catalog contains 332 earthquakes perceived by the population, which ranged from 2.2 Ml to 7.6 Mw in magnitude. Each entry in the catalog corresponds to a earthquake-municipality pair with its corresponding intensity value in the Mercalli scale. In total, the catalog comprises 8296 entries. We use a local-scope catalog because this catalog contains fine-grained data about earthquakes in the Chilean territory for all magnitude ranges, which are not otherwise available in global-scope catalogs.
4.1.2 Twitter dataset
Our second dataset corresponds to data obtained from the public Twitter stream, using the search API.6 In order to retrieve conversations related to earthquakes, we collected tweets that matched any of the following keywords (in Spanish) seismic, quake, shaking and earthquake. Overall, we collected 825,310 tweets, which were posted by 309,749 different users during the time period of our study (i.e., from January 2016 to June 2017). From these tweets, we wanted to keep only those that corresponded to earthquake mentions generated in Chile, so we could use them with our local-scope ground truth data. However, only 2200 of these tweets had GPS locations (0.26%). Therefore, we extracted additional location information from users’ profiles using the heuristic approach described in detail in Sect. 3.1. Using this approach we found that 207,015 users (i.e., 66.8%) registered a valid location in their profile, of which 57,546 indicated to belong to Chile. For the users that indicated being in Chile, we then used approximate matching to associate their profile information to a list of Chilean municipalities. This resulted in a total match of 41,885 users to Chilean municipalities, which in turn yielded a total of 187,317 tweets mapped to 345 different municipalities in Chile.
Our dataset in terms of municipality-level local instances and the coverage of Twitter over the GUC catalog
Twitter + GUC
Table 2 shows number of data units at municipality level for the GUC catalog (“GUC”), the intersections of both datasets (“Twitter + GUC”), the number of GUC events that did not have coverage on Twitter (“Not Covered”), and the percentage of units covered (“Coverage %”). According to Table 2, our dataset has large coverage of the events registered by the GUC, with almost perfect coverage for medium to high-energy events. Low-energy events are less reported on Twitter because many times they are not perceived by the population. Overall, the intersection between Twitter and the GUC data produces 6790 municipality-level data instances, with an average coverage of 81.8%.
4.1.3 Data characterization
The first 2-row block in Table 3 shows the correlation between the first variable, the target variable MERCALLI, and each of the twelve features used by our method. The second 2-row block shows correlation between the second variable, NUMBER OF TWEETS, and each of the remaining features (except MERCALLI), and so on for the rest of the table.
There is a positive correlation between NUMBER OF TWEETS and TWEETS NORM. There is a negative correlation between TWEETS NORM and POPULATION. An expected correlation arises between AVERAGE WORDS and AVERAGE LENGHT, and also between MENTION SYMBOLS and RT SYMBOLS, since the latter is a subset of the former. This is because messages that are re-posted on Twitter (i.e., retweeted) always include a mention to the author of the original message. As expected, the number of special symbols increases with tweet length.
A second aspect that we analyzed was the variance. Boxplots in Fig. 5 show low variance with MERCALLI in several cases (see boxplots for TWEETS NORM, AVERAGE LENGTH, and AVERAGE WORDS). However, there are features which show high variance in relation to MERCALLI, such as HASHTAG SYMBOLS, RT SYMBOLS, and MENTION SYMBOLS.
4.2 Experiment and results
Training/testing partition for events according to the maximum Mercalli intensity of each seismic movement. High energy events are less frequent than low energy events. Note that this table summarizes our dataset in terms of number of earthquakes, but for each of these earthquakes we have many municipality-level local instances. In fact, as high energy earthquakes cover a wider area, they produce many municipality-level instances
4.2.1 Region of interest estimation
Training accuracy per class for the region of interest estimation, using 5-fold cross validation
We applied the resulting model, obtaining 1867 correctly classified instances, over a total of 2847 instances, achieving an accuracy of 65.57%. This shows that the classifier generalizes well, since the overall accuracy of training and testing partitions are similar. Low precision for this task illustrates that this problem is difficult to solve probably due to the presence of noise at fine level aggregation. However, what remains important is that the testing recall is high, indicating good predictability for class 1.
Testing accuracy per class for the region of interest detection task. Low precision indicates that the problem is hard to solve using classification at fine level granularity. However, a simple 0/1 classifier is sufficient to infer the region of interest with 0.816 recall
Testing performance according to the actual Mercalli intensity for the region of interest detection task. A 0/1 classifier is sufficient for medium and high energy events at municipality level, showing good performance in terms of recall
4.2.2 Using regression and spatial smoothing to estimate Mercalli intensities
As detailed in Sect. 3.3, we performed the experimental validation for regression and spatial smoothing procedures to estimate Mercalli intensities. We used a support vector regression model with a sequential minimal optimization (SMO) algorithm implemented in Weka 3.7. We trained using five-fold cross-validation using as training instances municipalities where an earthquake was perceived with Mercalli values ranging from 1 to 7. To deal with intensity unbalance, we applied instance re-sampling biased to class uniformity, achieving a total of 5470 training instances. To calculate the support vectors, we used a normalized polynomial kernel with an exponent equal to 2. During the training process, the fitted model achieved a correlation coefficient of 0.65 with mean absolute error (MAE) of 1.15. The same configuration was used to fit an SMO regression model over a reduced set of features, with high correlation with the Mercalli intensity (NUM TWEETS, NUM TWEETS NORM, and POPULATION), achieving a correlation coefficient of only 0.304. Therefore, after corroborating that the best values for correlation coefficient in the regression were achieved using all features, we discarded the model based on the reduced set of features. We selected the model based on all 12 features as a baseline predictor of Mercalli at municipality level (denoted by \(m(i)\) in Equation (2)).
After re-evaluation on the test set, the correlation coefficient decreased to 0.26 with a MAE of 2.26. This result indicates that the sole use of a regression is insufficient to perform accurate predictions. We show next that the use of our adjusted Mercalli estimation and the inclusion of spatial smoothing boosts the method’s accuracy, outperforming the baseline.
Overall MAE at different values of λ
Table 8 shows the value of using spatial smoothing for our method. On the one hand, when spatial smoothing is dismissed (\(\lambda= 0\)), the method achieves its worst result with an Overall MAE at 2.078, which is almost the same value achieved using the baseline. On the other hand, when the prevalence of the spatial component increases, the error decreases. The best value is achieved at \(\lambda=0.8\).
To better understand the quality of our results, note that the results displayed in Fig. 6 correspond to the errors of a complete spatial Mercalli report, which is a result significantly distinct from those observed in state-of-the-art methods, where the Mercalli estimation is focused on the aggregated estimation of the maximum intensity per earthquake. We remark at this point that our method is the first to provide a complete spatial Mercalli report.
Wilcoxon rank sum test results for differences in absolute errors between our method and the baseline. The second column indicates the Wilcoxon statistic per each level of the test
4.2.3 Maximum intensity prediction task using social media
We compare our method with the state-of-the-art technique for the specific task of maximum intensity prediction using social media. According to our literature review, the method proposed by Cresci et al.  is the only one that we can be directly compared to ours, in the sense that it relies solely on Twitter data; as such, it can be evaluated using the same dataset used by our method. As we noted in our related work overview, in Sect. 2, the work by Burks et al.  also uses Twitter data to estimate intensity. However, their approach is not directly comparable to ours because their model uses Twitter in combination with actual seismograph measurements. Since the goal of our work is that of providing spatial Mercalli intensity reports based only Twitter data, we consider it beyond the scope of our current work to compare with approaches that use seismograph measurements. Regardless, we do believe it is important to understand the relationship between seismograph recordings and how Twitter data, in our case, can complement this data. We present further discussion on this in Sect. 5.
Testing averaged MAE of the maximum Mercalli intensity for each earthquake
Cresci et al. 
As Table 10 reveals, our method performs well in the specific task of maximum intensity prediction, being competitive with the state of the art. The method of Cresci et al.  outperforms our method at intensity 4 but at the cost of much less accurate predictions for low- and high-energy seismic movements. Note that at intensity 7, the method of Cresci et al. cannot detect the earthquake, while our method reaches only 1 point in MAE. This noteworthy result is because this earthquake was located at a rural locality in Chile (Limache), an area of the country that is sparsely populated. This earthquake produced a local trend in the region of interest but was practically uncommented on in the capital of Chile during the first half hour before the event. Then, the event just produced a local trend in the region of interest that was successfully detected by our method and discarded by our competitor.
The improvement of our method over the baseline is important for low- and medium-energy events. Note that the baseline corresponds to a regression over the 12 lexical features at municipality level, picking the maximum value detected in each earthquake. Our proposal applies spatial smoothing over highly supported regressors to finally pick the maximum. The results show that spatial smoothing is also useful for the maximum intensity detection task of low- and medium-energy events.
4.3 Illustrative examples
Illustrative examples of spatial Mercalli reports comparing actual and predicted Mercalli intensities
Talca (3), Constitución (3),
Maule (4), Talca (3),
Maule (3), Navidad (2),
Coquimbo (4), Ovalle (4),
Coquimbo (3), Colina (3),
Melipilla (2), Santiago (2)
Til–Til (3), Santiago (3)
La Serena (5), Coquimbo (5),
La Serena (5), Coquimbo (5),
Vicuña (4), Ovalle (4),
Vicuña (4), Ovalle (4),
Quintero (6), Valparaíso (5),
Quintero (6), Valparaíso (6),
Quilpue (5), Quillota (5),
Viña del Mar (6), Quilpue (5),
Viña del Mar (4), Ovalle(3),
Quillota (5), San Felipe (4),
Santiago (4), La Serena (3)
Limache(7), Santiago (6),
Limache (6), Viña del Mar (6),
Viña del Mar (6), Valparaíso (6),
Valparaíso (6), Santiago (5),
Coquimbo (5), La Serena (5),
Coquimbo (5), La Serena (5),
Ovalle (5), Rancagua (5),
Rancagua (5), Curicó (5),
Curicó (4), Coronel (3)
Ovalle (4), Quirihue (3)
Table 11 shows that our method works well, providing accurate spatial reports. The estimation of the maximum intensity is accurate, and it also detects the epicenter. Thus, our method can detect the maximum intensity of the earthquake as well as the location in which it was registered. The use of spatial smoothing gives excellent results in terms of damage detection along the national territory. As expected, high-intensity earthquakes produce longer reports, showing an almost perfect match with the actual report. The use of reinforced Mercalli support helps to detect medium- and low-energy events. Note that the earthquake at intensity 3, located in Talca, was successfully detected by our method and also matched for localities close to where the event was perceived.
In this paper, we propose the first method for predicting the distribution of spatial Mercalli intensities for earthquakes using only social media features. Our literature review shows many efforts towards earthquake detection using social media: mostly, the location where an earthquake was felt and the maximum earthquake intensity. Our proposal performs well for both the aforementioned tasks and is competitive in relation to the state-of-the-art. However, these are not the main goals of our work. Our main objective is to predict the spatial distribution of Mercalli intensities, without depending on geological models or on using signals captured by spatially distributed seismographs . Our empirical evaluation shows that social media provides valuable spatial information, which is helpful for the task of producing spatial intensity reports for earthquakes. In addition, we were successful in revealing local trends by using local-level high-support regressors. Our method uses a fine level of granularity in its spatial analysis (as opposed to prior approaches that use a more coarse-grained analysis) which allows us to detect an provide reports for medium-energy and high-energy events. Our experimental results show that our estimated reports are almost identical to those produced by experts.
On the other hand, our approach is not without its limitations. The main restriction on our proposed method is its dependency on the availability of spatially distributed social media data, specifically from Twitter. We think is possible to generalize our method use data from other social media platforms that contain textual messages and location information (given that this data is available). However, our method cannot work without some type of social media content. Therefore, in geographic places where there is little or no social media coverage we will not have enough data to produce accurate estimates. A similar situation can also occur when faced with disasters in which digital communications are interrupted and people are not able to post on social media. This type of limitations are not exclusive to our system, but they correspond to a drawback of all crisis informatics systems that rely solely on social media as a data source. Nevertheless, we do not see social media dependency as a threat, given that our system is designed for the purpose of providing social media information during crisis situations to enhance emergency response when possible.
In addition, another limitation is the quality of message location estimation. Currently, our approach uses a more or less standard heuristic method to infer message geolocation. This is needed due to the lack of GPS data associated to Twitter usage in Chile (and in many other countries). This can induce noise in our location data, thus, by incorporating better geo-mapping techniques our method could improve its accuracy. Recent work shows promising results in this direction , proving that methods based on the user friendship network obtain high values of accuracy in geolocation inference. However, finding the best approach for location estimation is beyond the scope of our current work. On this note, we believe that our method can be scaled to other countries. This notion of scalability is supported by the fact that the existing literature provides more than sufficient evidence of the usefulness of Twitter in different countries (e.g., Italy, New Zealand, Australia, the U.S., the U.K., and Japan) for rapidly gaining situational awareness. Furthermore, countries such as the U.K. show a much wider adoption in the use of GPS enabled devices, which could actually improve the performance of our approach for these countries.
In summary, in this article we have presented a method for spatial inference of damages after an earthquake releasing results in the Mercalli scale. Our contribution is that of providing a tool to allow for early response and to improve coverage in locations where there are no expert observers. Gaining accurate situational awareness as soon as possible after a disaster in extremely valuable for emergency response agencies and governments.
Future work includes dealing with open issues, such as measuring and understanding the contribution of social media data in relation to other data sources such as seismograph recordings. We note that we do not expect Twitter to outperform methods that include seismograph measurements, which can be extremely accurate, but rather to study how the combination with social media can enhance report immediacy and quality. To achieve this one possibility would be combine features from both sources, as was done by Burks et al. . Another open problem is that of studying our method’s sensitivity to location estimation quality and also to the keyword-based approach used to retrieve relevant tweets. It is possible that these terms could induce message undersampling, since we could miss relevant tweets that do not include the selected terms. However, by including more keywords we also risk adding more noise to our dataset. In this sense, we the keywords that we are currently using are the same as in those used in , which have showed excellent recall for earthquakes in Chile. This is supported by Table 2, which shows that the selected terms allow us to achieve good coverage in our dataset for all relevant events in the country. Hence, we think that these terms offer a good trade-off between noise and relevant messages, given the linguistic variations found in Chile. Nevertheless, in other countries, where language is more diverse depending on the region, this could be an important limitation.
Additionally, in the future we contemplate extending our work to incorporate more features. Time-based features extracted from the Twitter stream (e.g., tweet interval rate) are a valuable source of information in the earthquake detection task. We believe these types of features can also be helpful in the elaboration of spatial intensity reports. In addition, since the reports produced by our method are almost identical to those produced by experts, we plan to embed Twitter-based intensity estimations into the state-of-the-art earthquake detection and visualization system Twicalli9 .
“Did You Feel It?” website of the U.S. Geological Survey (USGS).
Fuzzy wuzzy, a Python string matching library that uses the Levenshtein Distance to compare string sequences https://github.com/seatgeek/fuzzywuzzy (set to 80% fuzzy confidence level).
We thank Jazmine Maldonado from Inria Chile for her valuable help with our dataset curation. We also thank Hernan Sarmiento from the Universidad de Chile who helped us improve our coverage of the existing literature.
Availability of data and materials
Data and its description are available at https://doi.org/10.6084/m9.figshare.c.4206689.
MM and BP acknowledge funding support from the Millennium Institute for Foundational Research on Data. MM was partially funded by the project BASAL FB0821. The funder played no role in the design of this study.
All authors collaboratively designed and performed the research, contributed new analytic tools, analyzed data, and wrote the paper. All authors read and approved the final manuscript.
The authors declare that no conflicts of interest exist.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Castillo C (2016) Big crisis data. Cambridge University Press, Cambridge View ArticleGoogle Scholar
- Fajardo J, Yasumoto K, Shibata N, Sun W, Ito M (2014) Disaster information collection with opportunistic communication and message aggregation. J Inf Process 22(2):106–117 Google Scholar
- Lee Hughes A, Palen L (2009) Twitter adoption and use in mass convergence and emergency events. Int J Emerg Manag 6(3/4):248–260 View ArticleGoogle Scholar
- Cameron MA, Power R, Robinson B, Yin J (2012) Emergency situation awareness from Twitter for crisis management. In: Proceedings of the 21st international conference on world wide web. WWW ’12 companion. ACM, New York, pp 695–698 View ArticleGoogle Scholar
- Imran M, Castillo C, Diaz F, Vieweg S (2015) Processing social media messages in mass emergency: a survey. ACM Comput Surv 47(4):1–38 View ArticleGoogle Scholar
- Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans Knowl Data Eng 25(4):919–931 View ArticleGoogle Scholar
- Avvenuti M, Cresci S, Marchetti A, Meletti C, Tesconi M (2014) EARS (earthquake alert and report system): a real time decision support system for earthquake crisis management. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 1749–1758 Google Scholar
- Maldonado J, Guzman J, Poblete B (2017) A lightweight and real-time worldwide earthquake detection and monitoring system based on citizen sensors. In: Fifth AAAI conference on human computation and crowdsourcing (HCOMP 2017). AAAI Press, Menlo Park, pp 137–146 Google Scholar
- Cresci S, La Polla M, Marchetti A, Meletti C, Tesconi M (2014) Towards a Timely Prediction of Earthquake Intensity with Social Media. IIT TR-12/2014 Technical report, IIT: Istituto di Informatica e Telematica, CNR Google Scholar
- Mendoza M, Poblete B, Valderrama I (2018) Early tracking of people’s reaction in Twitter for fast reporting of damages in the Mercalli scale. In: Meiselwitz G (ed) Social computing and social media. Technologies and analytics. Springer, Berlin, pp 247–257 View ArticleGoogle Scholar
- Zhou A, Qian W, Ma H (2012) Social media data analysis for revealing collective behaviors. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’12. ACM, New York, pp 1402–1402 View ArticleGoogle Scholar
- Earle P, Guy M, Buckmaster R, Ostrum C, Horvath S, Vaughan A (2010) OMG earthquake! Can Twitter improve earthquake response? Seismol Res Lett 81(2):246–251 View ArticleGoogle Scholar
- Palen L, Anderson KM (2016) Crisis informatics—new data for extraordinary times. Science 353(6296):224–225. https://doi.org/10.1126/science.aag2579 View ArticleGoogle Scholar
- Bagrow J, Wang D, Barabasi A (2011) Collective response of human populations to large-scale emergencies PLoS ONE 6(3):e17680 View ArticleGoogle Scholar
- Mendoza M, Poblete B, Castillo C (2010) Twitter under crisis: can we trust what we RT? In: Proceedings of the first workshop on social media analytics. SOMA ’10. ACM, New York, pp 71–79 View ArticleGoogle Scholar
- Castillo C, Mendoza M, Poblete B (2013) Predicting information credibility in time-sensitive social media. Internet Res 23(5):560–588 View ArticleGoogle Scholar
- De Longueville B, Smith RS, Luraschi G (2009) “OMG, from here, I can see the flames!”: a use case of mining location based social networks to acquire spatio-temporal data on forest fires. In: Proceedings of the 2009 international workshop on location based social networks. LBSN ’09. ACM, New York, pp 73–80 View ArticleGoogle Scholar
- Bauman K, Tuzhilin A, Zaczynski R (2017) Using social sensors for detecting emergency events: a case of power outages in the electrical utility industry. ACM Trans Manag Inf Syst 8(2–3):1–20 View ArticleGoogle Scholar
- Steinert-Threlkeld Z, Mocanu D, Vespignani A, Fowler J (2015) Online social networks and offline protest. EPJ Data Sci 4(19) Google Scholar
- Mukherjee T, Chander D, Eswaran S, Singh M, Varma P, Chugh A, Dasgupta K (2015) Janayuja: a people-centric platform to generate reliable and actionable insights for civic agencies. In: Proceedings of the 2015 annual symposium on computing for development. DEV ’15. ACM, New York, pp 137–145 View ArticleGoogle Scholar
- Vieweg S, Hughes AL, Starbird K, Palen L (2010) Microblogging during two natural hazards events: what Twitter may contribute to situational awareness. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, New York, pp 1079–1088 Google Scholar
- Kryvasheyeu Y, Chen H, Obradovich N, Moro E, Van Hentenryck P, Fowler J, Cebrian M (2016) Rapid assessment of disaster damage using social media activity. Sci Adv 2(3) https://doi.org/10.1126/sciadv.1500779
- Li T, Zhou W, Zeng C, Wang Q, Zhou Q, Wang D, Xu J, Huang Y, Wang W, Zhang M, Luis S, Chen S-C, Rishe N (2016) DI-DAP: an efficient disaster information delivery and analysis platform in disaster management. In: Proceedings of the 25th ACM international on conference on information and knowledge management. CIKM ’16. ACM, New York, pp 1593–1602 Google Scholar
- Rehman FU, Afyouni I, Lbath A, Basalamah S (2017) Understanding the spatio-temporal scope of multi-scale social events. In: Proceedings of the 1st ACM SIGSPATIAL workshop on analytics for local events and news. LENS’17. ACM, New York, pp 1–7 Google Scholar
- Rosas E, Hidalgo N, Gil-Costa V, Bonacic C, Marin M, Senger H, Arantes L, Marcondes C, Marin O (2016) Survey on simulation for mobile ad-hoc communication for disaster scenarios. J Comput Sci Technol 31(2):326–349 View ArticleGoogle Scholar
- Yin J, Lampert A, Cameron M, Robinson B, Power R (2012) Using social media to enhance emergency situation awareness. IEEE Intell Syst 27(6):52–59 View ArticleGoogle Scholar
- Doan S, Vo B-K, Collier N (2011) An analysis of Twitter messages in the 2011 Tohoku earthquake. In: International conference on electronic healthcare. Springer, Berlin, pp 58–66 Google Scholar
- Murakami A, Nasukawa T (2012) Tweeting about the Tsunami?: mining Twitter for information on the Tohoku earthquake and Tsunami. In: Proceedings of the 21st international conference on world wide web. WWW ’12 companion. ACM, New York, pp 709–710 View ArticleGoogle Scholar
- Kropivnitskaya Y, Tiampo KF, Qin J, Bauer MA (2017) The predictive relationship between earthquake intensity and tweets rate for real-time ground-motion estimation. Seismol Res Lett 88(3):840–850 View ArticleGoogle Scholar
- Crooks A, Croitoru A, Stefanidis A, Radzikowski J (2013) # Earthquake: Twitter as a distributed sensor system. Trans GIS 17(1):124–147 View ArticleGoogle Scholar
- Robinson B, Power R, Cameron M (2013) A sensitive Twitter earthquake detector. In: Proceedings of the 22nd international conference on world wide web. ACM, New York, pp 999–1002 Google Scholar
- Zhang X, Shasha D (2006) Better burst detection. In: Data engineering, 2006. ICDE’06. Proceedings of the 22nd international conference on. IEEE Press, New York, pp 146–146 Google Scholar
- Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on world wide web. ACM, New York, pp 851–860 View ArticleGoogle Scholar
- D’Auria L, Convertito V (2016) Real-time mapping of earthquake perception areas in the Italian region from Twitter streams analysis. In: Earthquakes and their impact on society. Springer, Berlin, pp 619–630 View ArticleGoogle Scholar
- Unankard S, Li X, Sharaf MA (2015) Emerging event detection in social networks with location sensitivity. World Wide Web 18(5):1393–1417 https://doi.org/10.1007/s11280-014-0291-3 View ArticleGoogle Scholar
- Burks L, Miller M, Zadeh R (2014) Rapid estimate of ground shaking intensity by combining simple earthquake characteristics with tweets. In: 10th US nat. conf. earthquake eng., front earthquake eng. Anchorage Google Scholar
- Yin J, Karimi S, Lingad J (2014) Pinpointing locational focus in microblogs. In: Proceedings of the 2014 Australasian document computing symposium. ACM, New York, p 66 Google Scholar
- Ribeiro S, Pappa GL (2018) Strategies for combining Twitter users geo-location methods. GeoInformatica 22(3):563–587 View ArticleGoogle Scholar
- Poblete B, Guzmán J, Maldonado J, Tobar F (2018) Robust detection of extreme events using Twitter: worldwide earthquake monitoring. IEEE Trans Multimed 20(10):2551–2561. https://doi.org/10.1109/TMM.2018.2855107 View ArticleGoogle Scholar