Sweet tweets! Evaluating a new approach for probability-based sampling of Twitter

Buskirk, Trent D.; Blakely, Brian P.; Eck, Adam; McGrath, Richard; Singh, Ravinder; Yu, Youzhi

doi:10.1140/epjds/s13688-022-00321-1

EPJ Data Science

Table A1.3 Calculation of the metrics we will use to evaluate the methods in our experiment. Bold-faced metrics represent the key outcomes of interest

From: Sweet tweets! Evaluating a new approach for probability-based sampling of Twitter

Statistic/metric^#	Description and calculation
R, D, M, S, T	R refers to one of the five MSA regions; D refers to one of the 38 days in our Field Period; M refers to one of the 6 sampling methods S refers to one of the four sample size settings (e.g. number of queries). T refers to one of the 8 topics including: COVID, Social Distancing, Working, Masks, Sanitizing, General Virus, Symptoms or Treatment.
\(\tau _{RDMS} (q)\)	Total number of geo-filtered Tweets in the qth query of a sample of size S taken from Region R on Day D using Method S. Note q = 1,2,3,…,S.
\(\tau _{RDMS}^{T} ( q )\)	Total number of geo-filtered Tweets containing any of the keywords for topic T in the qth query of a sample of size S taken from Region R on Day D using Method S. Note q = 1,2,3,…,S.
\(\hat{I}_{RDMS}^{T}\)	Estimated Incidence rate of Topic T among all geo-filtered Tweets within the sample of size S taken from region R on day D using method M and is computed as: \(\hat{I}_{RDMS}^{T} = {\sum_{q=1}^{S} \tau _{RDMS}^{T} ( q )} / {\sum_{q=1}^{S} \tau _{RMDS} (q)}\)
\(I_{RD}^{T}\); \(F_{RD}^{T}\)	Incidence rate and Frequency of topic T, respectfully, among all Tweets in region R on day D based on full twitter corpus data accessed through a TFV.
\(PRAB(I)_{RDMS}^{T}\); \(\boldsymbol{MPRAB}(\boldsymbol{I})_{\boldsymbol{RDMS}}\)	Percent Relative Absolute Bias for the Incidence of topic T based on geo-filtered Tweets from a sample of size S taken from region R on day D using method M and is computed as: \(PRAB(I)_{RDMS}^{T} = 100\times \vert \frac{\hat{I}_{RDMS}^{T} - I_{RD}^{T}}{I_{RD}^{T}} \vert \) Mean Percent Relative Absolute Bias for the Incidence is the average of the PRAB across all 8 Topics derived from a sample of size S taken from Region R on day D using method M, computed as: \(MPRAB(I)_{RDMS} = {\sum_{T} PRAB(I)_{RDMS}^{T}} / {8}\)
\(\hat{F}_{RDMS}^{T}\)	Estimated Frequency of the number of Tweets from topic T among all geo-filtered Tweets within the sample of size S taken from region R on day D using method M and is computed for SRS and VBEST methods as: \(\hat{F}_{RDMS}^{T} = N_{TPSUs} \times ( {\sum_{q=1}^{S} \tau _{RDMS}^{T} ( q )} / {S} )\) And for all other methods, computed as: \(\hat{F}_{RDMS}^{T} =86\text{,}400\times ( {\sum_{q=1}^{S} \tau _{RDMS}^{T} ( q )} / {\sum_{q=1}^{S} D_{RDMS} ( q )} )\) where \(N_{TPSUs}\) is the number of Tweet PSUs in the VBEST sampling frame and \(\sum_{q=1}^{S} D_{RDMS} ( q )\) represents the total amount of time (in seconds) (e.g. duration) required to gather all the Tweets (both geo-filtered and non-geo-filtered) across the S queries for the sample of size S taken from region R on day D using method M.
\(PRAB(F)_{RDMS}^{T}\); \(\boldsymbol{MPRAB}(\boldsymbol{F})_{\boldsymbol{RDMS}}\)	Percent Relative Absolute Bias for Frequency of Topic T and Mean Percent Relative Absolute Bias for Topic Frequencies are derived in the same manner as for the incidence-based metrics described above except using \(F_{RD}^{T}\) as the reference value.
\(\hat{N}_{RMDS}\); \(N_{RD}\)	Estimated (and actual) total number of geo-filtered Tweets in region R for day D based on a sample of size S taken using method M. The actual value is based on TFV data supplied from our vendor. The estimate totals are computed for SRS and VBEST as: \(\hat{N}_{RMDS} = N_{TPSUs} \times ( {\sum_{q=1}^{S} \tau _{RDMS} (q)} / {S} )\) And for all other methods, computed as: \(\hat{N}_{RMDS} =86400\times ( {\sum_{q=1}^{S} \tau _{RDMS} (q)} / {\sum_{q=1}^{S} D_{RDMS} ( q )} )\)
\(\boldsymbol{PRAB}(\boldsymbol{N})_{\boldsymbol{RDMS}}\)	Percent Relative Absolute Bias for Total Tweets within region R on day D based on geo-filtered Tweets from a sample of size S taken using method M. This statistic is computed as: \(PRAB(N)_{RDMS} =100\times \vert \frac{\hat{N}_{RMDS} - N_{RD}}{N_{RD}} \vert \)

^#Note: when we aggregate the MPRAB(I), MPRAB(F) and PRAB(N) metrics over all days of the experiment for a given combination of region, method and size we will refer to the measure as the overall average metric value.

Back to article page