X hits on this document

PDF document

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors - page 12 / 12

29 views

0 shares

0 downloads

0 comments

12 / 12

8.

REFERENCES

[1] J. Ahn, “Memory and Control Organizations of Stream Processors,” Ph.D. dissertation, Stanford University, 2007.

[2] J. Ahn, M. Erez, and W. J. Dally, “The Design Space of Data-Parallel Memory Systems,” in SC’06, Tampa, Florida, November 2006.

[3] J. Backus, “Can Programming be Liberated from the von Neumann Style?” Communications of the ACM, vol. 21, no. 8, pp. 613–641, August 1978.

[4] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” in Proceedings of the 27th International Symposium on Computer Architecture, Vancouver, Canada, June 2000.

[5] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: stream computing on graphics hardware,” ACM Trans. Graph., vol. 23, no. 3, pp. 777–786, 2004.

[6] ClearSpeed, “CSX600 datasheet,” http://www.clearspeed. com/downloads/CSX600Processor.pdf, 2005.

[7] W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte,

  • J.

    Ahn, N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and

  • I.

    Buck, “Merrimac: Supercomputing with streams,” in SC’03,

Phoenix, Arizona, November 2003.

[8] K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales, “Altivec extension to PowerPC accelerates media processing,” IEEE Micro, vol. 20, no. 2, pp. 85–95, 2000.

[9] ELPIDA Memory, Inc, “512M bits XDRT M DRAM,” http://www.elpida.com/pdfs/E0643E20.pdf.

[10] M. Erez, “Merrimac - High-Performance and High-Efficient Scientific Computing with Streams,” Ph.D. dissertation, Stanford University, 2006.

[11] M. Erez, J. Ahn, A. Garg, W. J. Dally, and E. Darve, “Analysis and Performance Results of a Molecular Modeling Application on Merrimac,” in SC’04, Pittsburgh, Pennsylvaniva, November 2004.

[12] L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun, “The Stanford Hydra CMP,” IEEE Micro, vol. 20, no. 2, pp. 71–84, 2000.

[13] M. S. Hrishikesh, D. Burger, N. P. Jouppi, S. W. Keckler, K. I. Farkas, and P. Shivakumar, “The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays,” in Proceedings of the 29th International Symposium on Computer Architecture, Anchorage, Alaska, 2002, pp. 14–24.

[14] N. Jayasena, M. Erez, J. Ahn, and W. J. Dally, “Stream Register Files with Indexed Access,” in Proceedings of the 10th International Symposium on High Performance Computer Architecture, Madrid, Spain, February 2004.

[15] U. J. Kapasi, “Conditional Techniques for Stream Processing Kernels,” Ph.D. dissertation, Stanford University, March 2004.

[16] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. Ahn, P. Mattson, and J. D. Owens, “Programmable Stream Processors,” IEEE Computer, pp. 54–62, August 2003.

[17] B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, and A. Chang, “Imagine: Media Processing with Streams,” IEEE Micro, pp. 35–46, March/April 2001.

[18] B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, and B. Towles, “Exploring the VLSI Scalability of Stream Processors,” in Proceedings of the 9th Symposium on High Performance Computer Architecture, Anaheim, California, February 2003.

[19] C. Kozyrakis, et al., “Scalable Processors in the Billion-Transistor Era: IRAM,” Computer, vol. 30, no. 9, pp. 75–78, 1997.

[20] C. Kozyrakis and D. Patterson, “Overcoming the Limitations of Conventional Vector Processors,” in Proceedings of the 30th International Symposium on Computer Architecture, San Diego, California, June 2003, pp. 399–409.

[21] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic, “The Vector-Thread Architecture,” in Proceedings of the 31st International Symposium on Computer Architecture, Munich, Germany, June 2004, pp. 52–63.

[22] E. A. Lee and D. G. Messerschmitt, “Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing,” IEEE Transactions on Computers, January 1987.

[23] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz, “Smart Memories: A Modular Recongurable Architecture,” in Proceedings of the 27th International Symposium on Computer Architecture, Vancouver, BC, Canada, June 2000, pp. 161–171.

[24] P. Mattson, W. J. Dally, S. Rixner, U. J. Kapasi, and J. D. Owens, “Communication Scheduling,” in 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, November 2000, pp. 82–92.

[25] S. Oberman, G. Favor, and F. Weber, “AMD 3DNow! Technology: Architecture and Implementations,” IEEE Micro, vol. 19, no. 2, pp. 37–48, 1999.

[26] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-Effective Superscalar Processors,” in Proceedings of the 24th International Symposium on Computer Architecture, Denver, Colorado, 1997, pp. 206–218.

[27] Y. N. Patt, S. J. Patel, M. Evers, D. H. Friendly, and J. Stark, “One Billion Transistors, One Uniprocessor, One Chip,” Computer, vol. 30, no. 9, pp. 51–57, 1997.

[28] D. Pham, et al., “The Design and Implementation of a First-Generation CELL Processor,” in Proceedings of the IEEE International Solid-State Circuits Conference, San Francisco, CA, February 2005, pp. 184–185.

[29] I. E. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia. John Wiley & Sons, Ltd, 2003.

[30] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens, “Register Organization for Media Processing,” in Proceedings of the 6th International Symposium on High Performance Computer Architecture, Toulouse, France, January 2000, pp. 375–386.

[31] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, “Trace Processors,” in Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture, North Carolina, 1997, pp. 138–148.

[32] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture,” in Proceedings of the 30th International Symposium on Computer Architecture, San Diego, California, June 2003, pp. 422–433.

[33] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, “Multiscalar Processors,” in Proceedings of the 22nd International Symposium on Computer Architecture, S. Margherita Ligure, Italy, 1995, pp. 414–425.

[34] S. T. Thakkar and T. Huff, “The Internet Streaming SIMD Extensions,” Intel Technology Journal, no. Q2, p. 8, May 1999.

[35] W. Thies, M. Karczmarek, and S. P. Amarasinghe, “StreamIt: A Language for Streaming Applications,” in Proceedings of the 11th International Conference on Compiler Construction, Grenoble, France, April 2002, pp. 179–196.

[36] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous multithreading: Maximizing on-chip parallelism,” in Proceedings of the 22th International Symposium on Computer Architecture, June 1995, pp. 392–403.

[37] E. Waingold, et al., “Baring it all to Software: Raw Machines,” pp. 86–93, September 1997.

Document info
Document views29
Page views29
Page last viewedThu Dec 08 00:51:25 UTC 2016
Pages12
Paragraphs566
Words11241

Comments