首页> 外文会议>International Conference on Selected Areas in Cryptography >PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA
【24h】

PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA

机译:Phirsa:利用Intel Xeon Phi对RSA的矢量信号

获取原文

摘要

Efficient implementations of public-key cryptographic algorithms on general-purpose computing devices, facilitate the applications of cryptography in communication security. Existing solutions work in two different directions: implementations on GPUs achieve high throughput but great latency, while those on CPUs are with low throughput and small latency. Intel Xeon Phi is the first highly parallel coprocessor of Many Integrated Core (MIC) architecture, with up to 61 cores and one 512-bit Vector Processing Unit (VPU) in each core, which offers the potential to achieve both high throughput and small latency. In this paper, we propose a vector-oriented Montgomery multiplication design based on vector carry propagation chain (VCPC) method to fully exploit the computing power of vector instructions on Intel Xeon Phi. Two key features of our design sharply reduce the number of instructions: (1) organizing the additions in Montgomery multiplication to be four VCPCs for saving the overhead of handling carry bits; (2) computing the inter-mediate scalar variable q in every round without breaking the flow of VCPCs. Furthermore, we offer the optimal Montgomery multiplication implementation of our design on Intel Xeon Phi, which make VPUs fully pipelined and maintain carry bits in vector mask registers. Based on the above, we implement RSA named PhiRSA and evaluate it on Intel Xeon Phi 7120P. For 1024, 2048 and 4096-bit RSA, PhiRSA performs 258,370, 41,803 and 5,358 decryptions per second, and the latencies are 0.94, 5.84 and 45.54ms, respectively. These results achieve 4.1 to 8.5 times performance of the existing RSA implementations on Intel Xeon Phi, exhibit high throughput comparable to those on GPUs but with much less parallel tasks, and small latency comparable to those on CPUs.
机译:通用计算设备上的公钥加密算法的高效实现,便于加密在通信安全性中的应用。现有的解决方案在两个不同的方向上工作:GPU的实现实现了高吞吐量,但延迟很大,而CPU的吞吐量较低,吞吐量低延迟。英特尔Xeon Phi是许多集成核心(MIC)架构的第一个高度平行的协处理器,每个核心最多61个核心和一个512位矢量处理单元(VPU),其提供了实现高吞吐量和小延迟的潜力。在本文中,我们提出了一种基于矢量携带传播链(VCPC)方法的向量导向的蒙哥马利乘法设计,以充分利用Intel Xeon Phi上的矢量指令的计算能力。我们设计的两个关键特性大幅减少指令数:(1)组织在蒙哥马利乘法加法是用于保存处理进位的开销4个VCPCs; (2)在每轮中计算中调解的标量变量Q,而不会破坏VCPC的流量。此外,我们提供我们在英特尔Xeon Phi上设计的最佳蒙哥格马利乘法实现,使VPU完全流水线并维持矢量掩模寄存器中的携带位。基于上述内容,我们实现了名为Phirsa的RSA,并在英特尔Xeon Phi 7120P上评估它。对于1024,2048和4096位RSA,Phirsa每秒执行258,370,41,803和5,358次解密,并且延迟分别为0.94,5.84和45.54ms。这些结果达到了Intel Xeon Phi上现有RSA实现的8.5倍,表现出与GPU上的高吞吐量,但具有更少的平行任务,以及与CPU上的延迟相当的小延迟。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号