Number of | Processing step |
---|---|
entries | |
22,002,522 | raw data |
21,995,183 | removed empty/defect IDs |
6,766,166 | removed impossible/idle entries (1000>words_read>1 and 2000>words_per_minute>1 and 60⋅10>durations_seconds>1) |
6,034,974 | removed entries before August 2019 and after May 2020 |
5,021,219 | removed words per minute less than 15 and higher than 600 (idling and skipping) |
4,689,763 | removed reading speed outliers according to 1 ⋅ interquartile range |
4,462,661 | removed words per minute outliers according to 1.5 ⋅ interquartile range |
4,218,061 | removed words read outliers according to 1.5 ⋅ interquartile range |
4,039,967 | removed seconds duration outliers according to 1 ⋅ interquartile range |
2,157,156 | after session detection, see Sect. 2.2 |
sessions | |
215,521 | total sessions detected, see Sect. 2.2 |
22,056 | for comparison with national test used in Sect. 3.1 (only users: with more than 10 sessions; in grade 3, 5, and 7; active in September to November 2019) |
209,952 | for improvement over time used in Sect. 3.2 (removed users that have fewer than four entries in four months) |
samples | session aggregates per user |
1542 | for model to predict reading speed in Sect. 3.3 |
1236 | training data used to fit the models |
308 | test data to evaluate the models |