Table 3 Number of entries, sessions, and samples used throughout this paper. Based on reading app data

From: Both sides of the story: comparing student-level data on reading performance from administrative registers to application generated data from a reading app

Number of Processing step
22,002,522 raw data
21,995,183 removed empty/defect IDs
6,766,166 removed impossible/idle entries (1000>words_read>1 and 2000>words_per_minute>1 and 6010>durations_seconds>1)
6,034,974 removed entries before August 2019 and after May 2020
5,021,219 removed words per minute less than 15 and higher than 600 (idling and skipping)
4,689,763 removed reading speed outliers according to 1 interquartile range
4,462,661 removed words per minute outliers according to 1.5 interquartile range
4,218,061 removed words read outliers according to 1.5 interquartile range
4,039,967 removed seconds duration outliers according to 1 interquartile range
2,157,156 after session detection, see Sect. 2.2
215,521 total sessions detected, see Sect. 2.2
22,056 for comparison with national test used in Sect. 3.1 (only users: with more than 10 sessions; in grade 3, 5, and 7; active in September to November 2019)
209,952 for improvement over time used in Sect. 3.2 (removed users that have fewer than four entries in four months)
samples session aggregates per user
1542 for model to predict reading speed in Sect. 3.3
1236 training data used to fit the models
308 test data to evaluate the models