Enriching feature engineering for short text samples by language time series analysis

EPJ Data Science

Table 7 Parametrization of NSC Model. Optimal settings are typeset in bold

Process	Module	Parameters	Values
Preprocessing	None	N/A
	Raw (light preprocessing)	Remove words not appearing in all authors’ writings
	Preprocess (heavy preprocessing)	Remove words not appearing in all authors’ writings and with a relative frequency less than 0.05 percent
Representation	Word n-grams	N-gram range	Start = 1–End = 2
		Minimum term frequency	1 (use all terms)
		Maximum term frequency	1.0 (no limit)
Vectorize	Count vectorizer	All set to default
Classifier	Nearest Shrunken Centroids (NSC)	Shrink threshold	Tuned by GridSearchCV using a 10-fold cross-validation