From: Sweet tweets! Evaluating a new approach for probability-based sampling of Twitter
Statistic/metric# | Description and calculation |
---|---|
R, D, M, S, T | R refers to one of the five MSA regions; D refers to one of the 38 days in our Field Period; MÂ refers to one of the 6 sampling methods S refers to one of the four sample size settings (e.g. number of queries). T refers to one of the 8 topics including: COVID, Social Distancing, Working, Masks, Sanitizing, General Virus, Symptoms or Treatment. |
\(\tau _{RDMS} (q)\) | Total number of geo-filtered Tweets in the qth query of a sample of size S taken from Region R on Day D using Method S. Note q = 1,2,3,…,S. |
\(\tau _{RDMS}^{T} ( q )\) | Total number of geo-filtered Tweets containing any of the keywords for topic T in the qth query of a sample of size S taken from Region R on Day D using Method S. Note q = 1,2,3,…,S. |
\(\hat{I}_{RDMS}^{T}\) | Estimated Incidence rate of Topic T among all geo-filtered Tweets within the sample of size S taken from region R on day D using method M and is computed as: \(\hat{I}_{RDMS}^{T} = {\sum_{q=1}^{S} \tau _{RDMS}^{T} ( q )} / {\sum_{q=1}^{S} \tau _{RMDS} (q)}\) |
\(I_{RD}^{T}\); \(F_{RD}^{T}\) | Incidence rate and Frequency of topic T, respectfully, among all Tweets in region R on day D based on full twitter corpus data accessed through a TFV. |
\(PRAB(I)_{RDMS}^{T}\); \(\boldsymbol{MPRAB}(\boldsymbol{I})_{\boldsymbol{RDMS}}\) | Percent Relative Absolute Bias for the Incidence of topic T based on geo-filtered Tweets from a sample of size S taken from region R on day D using method M and is computed as: \(PRAB(I)_{RDMS}^{T} = 100\times \vert \frac{\hat{I}_{RDMS}^{T} - I_{RD}^{T}}{I_{RD}^{T}} \vert \) Mean Percent Relative Absolute Bias for the Incidence is the average of the PRAB across all 8 Topics derived from a sample of size S taken from Region R on day D using method M, computed as: \(MPRAB(I)_{RDMS} = {\sum_{T} PRAB(I)_{RDMS}^{T}} / {8}\) |
\(\hat{F}_{RDMS}^{T}\) | Estimated Frequency of the number of Tweets from topic T among all geo-filtered Tweets within the sample of size S taken from region R on day D using method M and is computed for SRS and VBEST methods as: \(\hat{F}_{RDMS}^{T} = N_{TPSUs} \times ( {\sum_{q=1}^{S} \tau _{RDMS}^{T} ( q )} / {S} )\) And for all other methods, computed as: \(\hat{F}_{RDMS}^{T} =86\text{,}400\times ( {\sum_{q=1}^{S} \tau _{RDMS}^{T} ( q )} / {\sum_{q=1}^{S} D_{RDMS} ( q )} )\) where \(N_{TPSUs}\) is the number of Tweet PSUs in the VBEST sampling frame and \(\sum_{q=1}^{S} D_{RDMS} ( q )\) represents the total amount of time (in seconds) (e.g. duration) required to gather all the Tweets (both geo-filtered and non-geo-filtered) across the S queries for the sample of size S taken from region R on day D using method M. |
\(PRAB(F)_{RDMS}^{T}\); \(\boldsymbol{MPRAB}(\boldsymbol{F})_{\boldsymbol{RDMS}}\) | Percent Relative Absolute Bias for Frequency of Topic T and Mean Percent Relative Absolute Bias for Topic Frequencies are derived in the same manner as for the incidence-based metrics described above except using \(F_{RD}^{T}\) as the reference value. |
\(\hat{N}_{RMDS}\); \(N_{RD}\) | Estimated (and actual) total number of geo-filtered Tweets in region R for day D based on a sample of size S taken using method M. The actual value is based on TFV data supplied from our vendor. The estimate totals are computed for SRS and VBEST as: \(\hat{N}_{RMDS} = N_{TPSUs} \times ( {\sum_{q=1}^{S} \tau _{RDMS} (q)} / {S} )\) And for all other methods, computed as: \(\hat{N}_{RMDS} =86400\times ( {\sum_{q=1}^{S} \tau _{RDMS} (q)} / {\sum_{q=1}^{S} D_{RDMS} ( q )} )\) |
\(\boldsymbol{PRAB}(\boldsymbol{N})_{\boldsymbol{RDMS}}\) | Percent Relative Absolute Bias for Total Tweets within region R on day D based on geo-filtered Tweets from a sample of size S taken using method M. This statistic is computed as: \(PRAB(N)_{RDMS} =100\times \vert \frac{\hat{N}_{RMDS} - N_{RD}}{N_{RD}} \vert \) |