Breadth-first Search (BFS) is a fundamental graph problem. Due to the irregular nature of memory accesses to graph data structures, parallelization of BFS on cache-based systems leads to poor performance. Many issues, such as memory access latency, cache coherence policy, and inter-process synchronization, affect the throughput performance of BFS on such systems. In our proposed message-passing multi-softcore architecture, parallelization is achieved by exchanging information among autonomous softcores on FPGA. Several optimizations are performed to reduce the traffic on the interconnect and to enable designs with high clock rates. Implementations on a state of the art FPGA achieve clock rates in excess of 100 MHz. The sustained performance of our system ranges from 160 to 795 Million Edges Per Second on a DDR3 DRAM. This result approaches the upperbound set by the DRAM bandwidth, and it rivals the best performance from implementations on various multi-core computing platforms.
展开▼