Please wait a minute...
Tsinghua Science and Technology  2021, Vol. 26 Issue (5): 724-735    doi: 10.26599/TST.2020.9010035
Regular Articles     
Efficient Scheduling Mapping Algorithm for Row Parallel Coarse-Grained Reconfigurable Architecture
Naijin Chen(),Zhen Wang*(),Ruixiang He(),Jianhui Jiang(),Fei Cheng(),Chenghao Han()
School of Computer and Information Science, Anhui Polytechnic University, Wuhu 241000, China
School of Computer Science and Technology, Shanghai University of Electric Power, Shanghai 200090, China
School of Software Engineering, Tongji University, Shanghai 201804, China
Download: PDF (924 KB)      HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Row Parallel Coarse-Grained Reconfigurable Architecture (RPCGRA) has the advantages of maximum parallelism and programmable flexibility. Designing an efficient algorithm to map the diverse applications onto RPCGRA is difficult due to a number of RPCGRA hardware constraints. To solve this problem, the nodes of the data flow graph must be partitioned and scheduled onto the RPCGRA. In this paper, we present a Depth-First Greedy Mapping (DFGM) algorithm that simultaneously considers the communication costs and the use times of the Reconfigurable Cell Array (RCA). Compared with level breadth mapping, the performance of DFGM is better. The percentage of maximum improvement in the use times of RCA is 33% and the percentage of maximum improvement in non-original input and output times is 64.4% (Given Discrete Cosine Transfor 8 (DCT8), and the area of reconfigurable processing unit is 56). Compared with level-based depth mapping, DFGM also obtains the lowest averages of use times of RCA, non-original input and output times, and the reconfigurable time.



Key wordstemporal mapping      Reconfigurable Cell Array (RCA)      listed scheduling      communication costs     
Received: 30 July 2020      Published: 30 April 2021
Fund:  Natural Science Foundation of Anhui Province(1808085MF203);National Natural Science Foundation of China(61432017)
Corresponding Authors: Zhen Wang     E-mail: chennaijin@ict.ac.cn;wangzhenqq@hotmail.com;809031856@qq.com;jhjiang@tongji.edu.cn;957189105@qq.com;1048551181@qq.com
About author: Naijin Chen received the PhD degree in computer science and technology from Tongji University, Shanghai, China in 2013. He obtained postdoctoral certificate from Tianjin University, Tianjin, China in 2016. He is a member of China Computer Federation. He is currently a professor in Anhui Polytechnic University, Wuhu, China. His current research interests include reconfigurable computing and compiling, fault tolerant computing, reliability evaluation of high-level circuits, approximate computing, formal verification, semantic big data representation and reasoning, and pattern recognition and image processing.|Zhen Wang received the PhD degree in computer science and technology from Tongji University in 2008. She ever worked as a senior engineer in Synopsys from 2008 to 2013. She is a member of China Computer Federation. She is now working in Shanghai University of Electric Power. Her main research interests include fault tolerant computing, reliability evaluation of high-level circuits, and approximate computing.|Ruixiang He received the MS degree from Anhui Polytechnic University, Wuhu, China in 2018. He is currently an engineer in Paneng Electric Power Technology Co. Ltd, Nanjing, China. His current research interests include reconfigurable computing and compiling, fault tolerant computing, reliability evaluation of high-level circuits, and approximate computing.|Jianhui Jiang received the PhD degree in traffic information engineering and control from Shanghai Tiedao University (in April 2000, it was merged to Tongji University) in 1999. Since 2011, he has been the associate dean of the School of Software Engineering, Tongji University. He is a professor and PhD supervisor in Tongji University. He is a senior member of China Computer Federation. His main research interests include reconfigurable computing and compiling, dependable systems and networks, software reliability engineering, and VLSI test and fault tolerance.|Fei Cheng received the BS degree from Anhui Polytechnic University, Wuhu, China in 2019. He is now a master student at School of Computer and Information Science, Anhui Polytechnic University, Wuhu, China. His current research interests include reconfigurable computing and compiling, formal verification, fault tolerant computing, semantic big data representation and reasoning, and pattern recognition and image processing.|Chenghao Han received the BS degree from Suzhou University, Suzhou, China in 2020. He is now a master student at School of Computer and Information Science, Anhui Polytechnic University, Wuhu, China. His current research interests include reconfigurable computing and compiling, formal verification, and fault tolerant computing.
Cite this article:

Naijin Chen,Zhen Wang,Ruixiang He,Jianhui Jiang,Fei Cheng,Chenghao Han. Efficient Scheduling Mapping Algorithm for Row Parallel Coarse-Grained Reconfigurable Architecture. Tsinghua Science and Technology, 2021, 26(5): 724-735.

URL:

http://tst.tsinghuajournals.com/10.26599/TST.2020.9010035     OR     http://tst.tsinghuajournals.com/Y2021/V26/I5/724

Fig.?1 General RPCGRA architecture.
Fig.?2 Illustration of misplacement, direct cross-level, and interlaced modes mapped in one RCA.
Fig.?3 Comparison of DFGs.
AlgorithmParameter
MN1N2SSD (clock cycle)CCON (clock cycle)TTOTAL (clock cycle)
LBM4272617116159.5
LBDM4221916116152.5
DFGM316151499128.5
Table?1 Mapping parameter comparison of LBM, LBDM, and DFGM.
BenchmarkNumber of operations
Totaladdsubmulloglesslelsrlslasrmodxor
FEAL34600000040420
DCT89040163400000000
FFT83612121200000000
EWF620416803600000000
MATRIX41124806400000000
PHLMC4336128488032016032000
HHLFLIC59617213221202000362400
PHLMC86242249616000320486400
HHLFLC648152212136024012605200
DCT325621921291290001120000
Table?2 Set of benchmarks.
Fig.?4 Two kinds of PEA interconnection.
Fig.?5 Sub-DFG of DCT32.
StructureParameter
MN1N2SSD (clock cycle)CCON (clock cycle)TTOTAL (clock cycle)
Grid PEA22233949
Row router PEA10022229
Table?3 Mapping parameter comparison of grid PEA and row router PEA.
BenchmarkMN1N2
LBMDFGM%LBMDFGM%LBMDFGM%
RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8
FEAL32320011101110001110111000
DCT84332–25.0–33.357452316–59.6–64.457452316–59.6–64.4
FFT843430017161716001716171600
EWF67665–14.3–16.7141121116104–17.7–14.01161089494–19.0–13.0
MATRIX44333–25.0080705051–37.5–27.180705051–37.5–27.0
PHLMC49887–11.1–12.5209177142145–32.1–18.1201167140140–30.3–16.2
HHLFLIC16131512–6.3–7.7547517314309–42.6–40.0537506301300–43.9–40.7
PHLMC816131512–6.3–7.7578516293298–49.3–42.2529484286296–45.9–38.8
HHLFLC191519140–6.7560547345340–38.4–37.8558545341340–38.9–37.6
DCT3217141512–11.8–14.3352276204181–42.0–34.4334266194179–41.9–32.7
Average %0000–14.3–14.10000–39.9–34.80000–39.6–33.8
Table?5 Comparison of DFGM and LBM (M, N1, and N2).
BenchmarkSSDCCONTTOTAL
LBMDFGM%LBMDFGM%LBMDFGM%
RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8
FEAL29242924008568856800142.5119.5142.5119.500
DCT825232318–4.0–21.7158141141124–10.8–12.1293262240211–18.1–19.5
FFT81091090010487104870014512614512600
EWF649394234–14.3–12.8323306306289–5.3–5.6575.5534.5528497–8.3–7.0
MATRIX427232726013.0180163163163–9.40359328312312–13.1–4.9
PHLMC4696070651.48.3489472472455–3.5–3.6931872851830.5–8.6–4.8
HHLFLIC1381291441274.3–1.6868817851800–2.0–2.11808.5171815631492–13.6–13.2
PHLMC81231051341178.911.4896845879828–1.9–2.01868.5174615981538–14.5–11.9
HHLFLC1481391521332.7–4.39719039718860–1.9197218821745.51653–11.5–12.2
DCT32153142149133–2.6–6.3851800817766–4–4.31660152614781392–11.0–8.8
Average %0000–0.5–1.80000–5.7–4.50000–12.3–10.3
Table?6 Comparison of DFGM and LBM (S𝐒𝐃, C𝐂𝐎𝐍, and T𝐓𝐎𝐓𝐀𝐋).
BenchmarkMN1N2
LBDMDFGM%LBDMDFGM%LBDMDFGM%
RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8
FEAL323200111011100011411400
DCT833320–33.028202316–18.0–20.028202316–18.0–20.0
FFT843430017161716001716171600
EWF66565001161081161040–3.7949694940–2.1
MATRIX433330057535051–12.0–3.857535051–12–3.8
PHLMC49787–11.00148147142145–4.1–1.4146145140140–4.1–3.4
HHLFLIC15121512003103053143091.31.3309308301300–2.6–2.6
PHLMC8151215120029329629329800.7295294286296–3.10.7
HHLFLC181519145.6–6.7354353345340–2.5–3.7349352341340–2.3–3.4
DCT3216131512–6.3–7.7209190204181–2.4–4.7202183194179–4.0–2.2
Average %0000–3.9–16.00000–6.3–4.40000–6.6–4.6
Table?7 Comparison of DFGM and LBDM (M, N1, and N2).
BenchmarkSSDCCONTTOTAL
LBDMDFGM%LBDMDFGM%LBDMDFGM%
RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8RCA6×7RCA7×8
FEAL29242924008568856800142.5119.5142.5119.500
DCT8232223180–18.21411411411240–12.1245236240211–2.0–10.1
FFT8108109012.510487104870014512514512600.8
EWF645394234–6.7–12.830628930628900531505528497–0.6–1.6
MATRIX42327272617.4–3.716316316316300315315312312–1.0–1.0
PHLMC474627065–5.74.8489455472455–3.50878831851830.5–3.1–0.1
HHLFLIC1421281441271.4–0.88518008518000015631495156314920–0.2
PHLMC81281161341174.70.98798288798280015971535159815380.10.2
HHLFLC1461431521334.1–7.09549039718861.8–1.917601692.51745.51653–0.8–2.3
DCT32153141149133–2.6–5.7834783817766–2.0–2.21505.51423.514781392–1.8–2.2
Average %00001.8–3.30000–1.2–5.40000–1.3–1.8
Table?8 Comparison of DFGM and LBDM (S𝐒𝐃, C𝐂𝐎𝐍, and T𝐓𝐎𝐓𝐀𝐋).
[1]   Cardoso J. M. P., Diniz P. C., and Weinhardt M., Compiling for reconfigurable computing: A survey, ACM Computing Surveys, vol. 42, no. 4, pp. 1301–1365, 2010.
[2]   Yoon J. W., Shrivastava A., Park S., Ahn M., and Paek Y., A graph drawing based spatial algorithm for coarse-grained reconfigurable architectures, IEEE Transactions on Very Large Scale Integration Systems, vol. 17, no. 11, pp. 1565–1578, 2009.
[3]   Berekovic M., Kanstein A., Mei B., and Sutter B. D., Mapping of nomadic multimedia applications on the ADRES reconfigurable array processor, Microprocessors and Microsystems, vol. 33, no. 4, pp. 290–294, 2009.
[4]   Ferreira R. S., Cardoso J. M. P., Damiany A., Vendramini J., and Teixeira T., Fast placement and routing by extending coarse grained reconfigurable arrays with Omega networks, Journal of Systems Architecture, vol. 57, no. 8, pp. 761–777, 2011.
[5]   Krishnamoorthy R., Varadarajan K., and Nandy S. K., Interconnect-topology independent mapping algorithm for a coarse grained reconfigurable architecture, in Proc. of 2011 International Conference on Field Programmable Technology, New Delhi, India, 2011, pp. 1–5.
[6]   Ahn M., Yoon J. W., Paek Y., Kim Y., Kiemb M., and Choi K., A spatial mapping algorithm for heterogeneous coarse grained reconfigurable architectures, in Proc. of the Conference on Design, Automation and Test in Europe, Munich, Germany, 2006, pp. 363–368.
[7]   Ansaloni G., Tanimura K., Pozzi L., and Dutt N., Integrated kernel partitioning and scheduling for coarse grained reconfigurable arrays, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 12, pp. 1803–1816, 2012.
[8]   Lee G., Choi K., and Dutt N. D., Mapping multi-domain applications onto coarse grained reconfigurable architectures, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 5, pp. 637–650, 2011.
[9]   Jo M., Lee D., Han K., and Choi K., Design of a coarse-grained reconfigurable architecture with floating-point support and comparative study, Integration, the VLSI Journal, vol. 47, no. 2, pp. 232–241, 2014.
[10]   Kim W., Choi Y., and park H., Fast modulo scheduler utilizing patternized routes for coarse-grained reconfigurable architectures, ACM Transactions on Architecture and Code Optimization, vol. 10, no. 4, pp. 1–24, 2013.
[11]   Chen N. J. and Jiang J. H., Mapping algorithm for coarse-grained reconfigurable multimedia architectures, in Proc. of 19th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Workshop, Shanghai, China, 2012, pp. 281–286.
[12]   Yu S. D., Research on the software/hardware co-design for reconfigurable processor, PhD dissertation, School of Information Science and Technology, Tsinghua University, Beijing, China, 2009.
[13]   Singh H., Lee M.-H., Lu G., Kurdahi F. J., and Filho E. M. C., MorphoSys: An integrated reconfigurable system for data parallel and computation intensive applications, IEEE Transactions on Computers, vol. 49, no. 5, pp. 465–481, 2000.
[14]   Chen N. J., Jiang J. H., Chen X., Zhou Z., and Xu Y., An improved level partitioning algorithm considering minimum execution delay and resource restraints, Acta Electronica Sinica, vol. 40, no. 5, pp. 1055–1066, 2012.
[15]   Xiao J., Shi Z. H., Zhu W. D., Jiang J. H., Zhou Q. W., Lou J., Huang Y., Ji Q., and Sun Z., Uniform non-Bernoulli sequences oriented locating method for reliability-critical gates, Tsinghua Science and Technology, vol. 26, no. 1, pp. 24–35, 2021.
[16]   Ouyang Y. M., Wang Q., Li Z., Liang H. G., and Li J., Fault-tolerant design for data efficient retransmission in WiNoC, Tsinghua Science and Technology, vol. 26, no.1, pp. 85–94, 2021.
[17]   Sangyun O., Hongsik L., and Jongeun L., Efficient execution of stream graphs on coarse-grained reconfigurable architectures, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 12, pp. 1978–1988, 2017.
[18]   Bae I., Harris B., Min H., and Egger B., Auto-tuning CNNs for coarse-grained reconfigurable array-based accelerators, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2301–2310, 2018.
[19]   Gu J. Y., Yin S. Y., liu L. B., and Wei S. J., Stress-aware loops mapping on CGRAs with dynamic multi-map reconfiguration, IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 9, pp. 2105–2120, 2018.
No related articles found!