Enriching feature engineering for short text samples by language time series analysis

Yichen Tang; Kelly Blincoe; Andreas W. Kempa-Liehr

摘要

In this case study, we are extending feature engineering approaches for short text samples by integrating techniques which have been introduced in the context of time series classification and signal processing. The general idea of the presented feature engineering approach is to tokenize the text samples under consideration and map each token to a number, which measures a specific property of the token. Consequently, each text sample becomes a language time series, which is generated from consecutively emitted tokens, and time is represented by the position of the respective token within the text sample. The resulting language time series can be characterised by collections of established time series feature extraction algorithms from time series analysis and signal processing. This approach maps each text sample (irrespective of its original length) to 3970 stylometric features, which can be analysed with standard statistical learning methodologies. The proposed feature engineering technique for short text data is applied to two different corpora: the Federalist Papers data set and the Spooky Books data set. We demonstrate that the extracted language time series features can be successfully combined with standard machine learning approaches for natural language processing and have the potential to improve the classification performance. Furthermore, the suggested feature engineering approach can be used for visualizing differences and commonalities of stylometric features. The presented framework models the systematic feature engineering based on approaches from time series classification and develops a statistical testing methodology for multi-classification problems.

机译：在这种情况下，我们通过集成在时间序列分类和信号处理的上下文中引入的技术来扩展了短文本样本的特征工程方法。呈现的特征工程方法的一般思想是授权正在考虑的文本样本，并将每个令牌映射到一个数字，该数字测量令牌的特定属性。因此，每个文本样本成为语言时间序列，该语言时间序列由连续发射的令牌生成，并且时间由文本样本内的各个令牌的位置表示。得到的语言时间序列可以通过时间序列分析和信号处理的建立时间序列特征提取算法的集合来表征。这种方法将每个文本样本（无论其原始长度）映射到3970款特征，可以用标准统计学习方法分析。拟议的短文本数据的特征工程技术应用于两个不同的语料库：联邦主义论文数据集和幽灵图书数据集。我们证明，提取的语言时间序列特征可以成功地结合自然语言处理的标准机器学习方法，并有可能提高分类性能。此外，建议的特征工程方法可用于可视化仪表特征的差异和共性。呈现的框架根据时间序列分类的方法模拟系统特征工程，并为多分类问题开发统计测试方法。

Enriching feature engineering for short text samples by language time series analysis

摘要

著录项

相关主题

期刊订阅