首页> 外文会议>Insternational Joint Conference on Natural Language Processing >You don't have to think twice if you carefully tokenize
【24h】

You don't have to think twice if you carefully tokenize

机译:如果你仔细刻字,你不必三思而后行

获取原文

摘要

Most of the currently used tokenizers only segment a text into tokens and combine them to sentences. But this is not the way, we think a tokenizer should work. We believe that a tokenizer should support the following analysis components in the best way it can. We present a tokenizer with a high focus on transparency. First, the tokenizer decisions are encoded in such a way that the original text can be reconstructed. This supports the identification of typical errors and - as a consequence - a faster creation of better tokenizer versions. Second, all detected relevant information that might be important for subsequent analysis components are made transparent by XML-tags and special information codes for each token. Third, doubtful decisions are also marked by XML-tags. This is very helpful for off-line applications like corpora building, where it seems to be more appropriate to check doubtful decisions in a few minutes manually than working with incorrect data over years.
机译:大多数当前使用的烙印器只将文本分段为令牌并将它们与句子组合起来。 但这不是方式,我们认为销有牌子应该工作。 我们认为销量应该以最佳方式支持以下分析组件。 我们展示了一个高度关注透明度的牌子。 首先,销售器决定以这样的方式编码,即可以重建原始文本。 这支持识别典型错误和 - 因此 - 更快地创建了更好的销售器版本。 其次,所有检测到的相关信息对于后续分析组件可能是透明的,每个令牌的XML标签和特殊信息代码都是透明的。 第三,可疑的决策也标有XML标签。 这对于像洛杉矶建筑物这样的离线应用非常有用,在这里似乎更适合在手动的几分钟内检查令人怀疑的决策,而不是多年来使用不正确的数据。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号