# A Paper in Latex

Here is a paper published at a conference. It is formatted using the ACM and IEEE accepted format.

Note that this paper does not use a separate bibliography file.

## The Main Document

\documentclass[10pt]{article} \usepackage{latexcad} \usepackage{setspace} \setlength{\textheight}{22cm} \setlength{\textwidth}{17cm} \topmargin 283mm \advance \topmargin -\textheight \divide \topmargin by 4 \advance \topmargin -1.0in \leftmargin 216mm \advance \leftmargin -\textwidth \divide \leftmargin by 2 \advance \leftmargin -1.0in \oddsidemargin \leftmargin \evensidemargin \leftmargin \newtheorem{teorema}{Theorem} \newtheorem{exemplu}{Example} \newtheorem{definitie}{Definition} \sloppy \twocolumn %\doublespacing \begin{document} %\title{{\bf An Integral Parallel Architecture for Embedded Computation}} \title{{\bf {\em Not Multi-, but Many-Core}:\\Designing Integral Parallel Architectures for Embedded Computation}} \author{Mihaela Mali\c ta\footnote{St. Anselm College, {\tt mmalita@anselm.edu}}, Dominique Thi\' ebaut\footnote{Smith College, {\tt thiebaut@cs.smith.edu}}, Gheorghe \c Stefan\footnote{BrightScale Inc., {\tt gstefan@brightscale.com}}} \date{Paper submission deadline: 4-th of May, 2007} \maketitle % http://www.arch.cs.titech.ac.jp/ALPS/ \begin{abstract} The recent switch to fully programmable parallel architectures in the embedded system domain must be accompanied by appropriate solutions allowing to implement efficiently all the corner cases that usually occur in real applications. We introduce the {\em integral parallel architecture } (IPA) as the best way to currently support intensive data computation in System-on-a-chip (Soc) implementations, with {\em small area} \& {\em low power}. An IPA supports all the three possible styles of parallelism: {\em data}, {\em time}, and {\em speculative}. As an illustrative example, we present the {\bf {\em BA1024}} chip, a fully programmable SoC designed by {\em BrightScale, Inc.} for HDTV codecs. Its main performance figures include $60 \, GOPS/Watt$ and $2 \, GOPS/mm^{2}$, illustrating an efficient IPA approach in embedded computation. \end{abstract} %\twocolumn \section{Introduction} Technology is currently going through some important evolutions that several researchers, in particular Borkar\cite{Borkar05}, and Asanovic\cite{Asanovic06} have identified: \begin{itemize} \item the slow-down of the increasing rate of the clock speed, \item the switch from pure standard functionality to more specific functionality in video, graphics, and performance hungry applications requiring full programmability, and \item the replacement of Application Specific Integrated Circuits (ASIC) by programmable Systems on a Chip (Soc) due to development costs and increasing technological difficulties. \end{itemize} Borkar and Asanovic propose several new approaches to computer architectures to respond to these changes and curb their effects: \begin{itemize} \item application domain oriented architectures in two versions: {\bf many}- or {\bf multi}-processors, or \item computation type oriented architectures, such as the Dwarf approach in Asanovic's paper\cite{Asanovic06}. \end{itemize} Intel's Recognition, Mining and Synthesis (RMS) white paper is a good example of the first type of architecture\cite{Borkar05}. In this paper we propose two solutions to address some of the limitations imposed by the current technology shifts: \begin{itemize} \item an optimized approach for {\em low-power \& small-area} embedded computation in SoC, and \item a way to remove some limitations emphasized by Asanovic \cite{Asanovic06} as the 13th Dwarf, and referring to automaton-style computation." \end{itemize} The validity of our solutions rests on two hypotheses. One is that programmable SoCs can compete with ASICs only if a fully programmable parallel architecture is used, because a circuit is an intrinsically parallel system. The other hypothesis holds that the computational model of partial recursive functions \cite{Kleene36} must be able to treat equally well both circuits and parallel programmable systems. Our approach naturally leads to two main results: \begin{itemize} \item the definition of an IPA\footnote{ The reader is invited to see our approach as being different form the {\em heterogeneous computing systems} which are those with a range of diverse computing resources that can be local to one another or geographically distributed.} for intensive computations in embedded systems, and \item the proposal of a more nuanced taxonomy of parallel computation as opposed to the more structural and functional approach first introduced by Flynn \cite{Flynn72}. \end{itemize} Both are exemplified by the {\em BA1024}, which we believe is the first embodiment of a IPA for the HDTV market, and which we use as an illustrative example in this paper. Parallel computation is becoming ever more ubiquitous, and manifests itself in two extreme forms, one in {\em complex} computation and the other in {\em intense} data-parallel computation. Our paper deals with the second form. The remainder of this paper is structured as follows. In Section 2 we introduce different views and trends of the parallel computing landscape as seen by researchers at Intel and at Berkeley. In Section 3 we present partial recursiveness and show how it can be used to identify different parallel constructs. This leads us in Section 4 to the definition of Integral Parallel Architectures, and the two types of forms in which they appear. Section 5 presents a state of the art circuit for video decoding and show how it embodies the different types of parallelism we have presented. Section 6 concludes with some pertinent remarks about current designs what we feel trends current design should follow in order to match the performance levels required by tomorrow's applications. \section{Comments on Intel's RMS \& Berkley's Dwarf Approaches} Intel's approach makes the distinction between {\em multi}-core era and {\bf many}-core era, between scalar and parallel applications, and massively parallel applications. For embedded systems a many-core approach seems to be the solution because we must substitute SoC implemented in ASIC technologies with SoC implemented as fully programmable solutions. In this case, the functional granularity in ASICs must have a counterpart in the granularity of the parallel programmable system used in high-performance embedded applications. Many-core approaches also present low-power solutions where simple processing elements (PEs) are fully utilized in each clock cycle to perform either simple functions, one per clock cycle, or complex functions composed of simpler ones. In his paper describing the landscape of parallel computing research as seen from Berkeley, Asanovic \cite{Asanovic06} makes the following statement: \begin{quote} {\em We argue that these two ends of the computing spectrum {\em [embedded and high performance computing]} have more in common looking forward than they did in the past. First, both are concerned with power, $\dots$. Second, both are concerned with hardware utilization, $\dots$. Third, $\dots$ the importance of software reuse must increase.} \end{quote} The architecture we propose here is mainly oriented to solve the first two issues -- {\em low-power \& hardware reuse} -- using many simple programmable, though not reconfigurable, elements. To define this architecture, we start from the computational model which has a very direct circuit or computing network implementation: the {\em partial recursive function} model, which we present in the next section. \section{From Partial Recursiveness to Parallel Computation} We claim that the most suggestive classic computational model for defining parallel architectures is the model of partial recursive functions \cite{Brainerd74}, because the rules defining it have a direct correspondence with circuits, the intrinsic parallel support for computation. \subsubsection{Composition \& basic parallel structures} The first rule of composition provides the basic parallel structures to be used in defining all the forms of parallelism. Assume we have $m$ $n$-ary functions $h_{i}(x_{0}, \ldots x_{n-1})$, for $i= 0, 1, \ldots m-1$, along with an $m$-ary function $g(y_{0}, \ldots y_{m-1})$. We can define the composition rule by combining them as follows: $$f(x_{0}, \ldots x_{n-1}) = g(h_{0}(x_{0}, \ldots x_{n-1}), \ldots h_{m-1}(x_{0}, \ldots x_{n-1}))$$ The physical structure associated with this concept, and containing simple circuits or simple programmable machines is illustrated in Figure \ref{composition}. \placedrawing[h]{draw1_1.lp} {{\bf The physical structure associated to the composition rule.} {\small The composition of the function $g$ with the functions $h_{0}, \ldots, h_{m-1}$ implies a two-level system. The first level, performing in {\bf parallel} $m$ computations, is {\bf serially} connected with the second level which perform a reduction function.}}{composition} This suggests the following four separate and meaningful forms of parallelism, as illustrated in Figure \ref{comp}: \begin{enumerate} \item {\bf data parallel composition}: when $n=m$, each function $h_{i} = h$ depends on a single input variable $x_{i}$, for $i=0, 1, \ldots n-1$, and $g$ performs the identity function (see Figure \ref{comp}a). Given an input {\bf vector} containing $n$ scalars: $$\{{\bf X}\} = \{x_{0}, x_{1}, \ldots, x_{n-1}\}$$ the result is another vector: $$\{h(x_{0}), h(x_{1}), \ldots, h(x_{n-1})\}.$$ \placedrawing[h]{draw1_2.lp} {{\bf The four simple forms of composition.} {\small {\bf a.} Data parallel composition. {\bf b.} Speculative composition. {\bf c.} Serial composition. {\bf d.} Reduction composition.}}{comp} \item {\bf speculative composition}: when $n=1$, i.e. $x_{0} = x$, (see Figure \ref{comp}b), $g$ performs the identity function. In other words, it computes a vector of functions: $${\bf H} = [h_{0}(x), \ldots h_{n-1}(x)].$$ on the same {\bf scalar} input $x$, generating a vector of results: $${\bf H}(x) = \{h_{0}(x), h_{1}(x), \ldots, h_{n-1}(x)\}.$$ \item {\bf serial composition}: This corresponds to the case $n = m = 1$, as shown in Figure \ref{comp}c. Here, a {\em pipe} of different machines receives a {\bf stream} of $n$ scalars as input: $$<{\bf X}> = <x_{0}, x_{1}, \ldots, x_{n-1}>$$ and provides another stream of scalars $$<f(x_{0}), f(x_{1}), \ldots, f(x_{n-1})>.$$ In the general case, the function $f(x)$ is a composition of more than two functions $h$ and $g$. Thus, the function $f$ can be expressed as a vector of functions ${\bf F}$ receiving as input a data stream ${\bf <X>}$: $${\bf F(<X>)} = [f_{0}(x), \ldots f_{p-1}(x)].$$ In Figure \ref{comp}c, ${\bf F}(<X>) = [h(x), g(x)]$. \item {\bf reduction composition}: In this last case, each $h_{i}$ performs the identity function, as is illustrated in Figure \ref{comp}d. The input is the vector $\{x_{0}, \ldots, x_{n-1}\}$ to the block in which the function $g$ transforms the stream of vectors into a stream of scalars $g(x_{0}, \ldots x_{n-1})$. \end{enumerate} The composition rule provides the context for defining computation using the following basic concepts: \begin{description} \item[scalar]: $x$ \item[vector]: $\{{\bf X}\} = \{x_{0}, x_{1}, \ldots, x_{n-1}\}$ \item[stream]: $<{\bf X}> = <x_{0}, x_{1}, \ldots, x_{n-1}>$ \item[function]: $f(x)$ \item[vector of functions]: \begin{itemize} \item ${\bf F(<X>)} = [f_{0}(x), \ldots f_{p-1}(x)]$ applied on streams \item ${\bf F(x)} = [f_{0}(x), \ldots f_{p-1}(x)]$ applied on scalars. \end{itemize} \end{description} We now have all that is required to define the concepts of primitive recursive and of minimalization in our context. \subsubsection{Primitive recursive rule} The primitive recursive rule computes the function $f(x,y)$ as follows: $$f(x,y) = h(x, f(x, y-1))$$ where: $f(x,0) = g(x)$. This rule can be translated in the following serial composition: $$f(x,y) = h(x, h(x, h(x, \ldots h(x, g(x)) \ldots ))).$$ There are two ways to implement in parallel the primitive recursive rule. In both cases we assume that a large amount of data is available for the computation in the form of vector or streams of data ready for input into a primitive recursive function. If the function $f(x,y)$ is to be computed for the vector of scalars $\{{\bf X}\} = \{y_{0}, y_{1}, \ldots, y_{n-1}\}$, then a data parallel structure is used. In this case each machine computes the function $f(x,y_{i})$ using a local {\em data loop},. The resulting computation takes $max(y_{0}, y_{1}, \ldots, y_{n-1})$ cycles" If, on the other hand, the function $f(x,y)$ is to be computed for a stream of scalars, then a time parallel structure is used. A pipe" of $n$ machine is selected. The pipe receives in each cycle" a new scalar from the stream of scalars. In cases where $y>n$, then a {\em data loop} can be added, connecting the output of the pipe back to its input. \subsubsection{Minimalization} The minimalization rule assumes a function $f(x)$ defined as follows $$f(x) = min(y)[m(x,y)=0]$$ The value of $f(x)$ is the minimum $y$ for which $m(x,y)=0$. Here as well, two different parallel solutions are possible for the minimalization problem: one that uses data parallel structures while the other one uses time parallel structures. The first implementation of the minimilization function is the brute-force approach, and uses the speculative structure represented in Figure \ref{comp}b, where each block computes a function which returns a pair containing a predicate and a scalar: $$h_{i} = \{(m(x,i)=0), i\}$$ after which a reduction step (using a structure belonging to the class represented in Figure \ref{comp}d) selects the smallest $i$ from all pairs having the form $\{1, i\}$, if any, that were generated on the previous speculative composition level. All pairs of the form $\{0, i\}$ are ignored. The second implementation of the minimalization operation shows up in time-parallel environments where speculation can be used to speed-up pipeline processing. {\bf Reconfigurable pipes} can be conceived and implemented using special reduction features distributed along a pipe. We formalize this concept now. We define a function pipe as the function vector: $${\bf P} = [f_{0}(x), \ldots f_{p-1}(x)]$$ where $y_{i} = f_{i}(x)$, for $i = 0, \ldots p-1$. The associated reconfigurable pipe transforms the original pipe characterized by: $${\bf P} = [\ldots f_{i}(y_{i-1}), \ldots]$$ into a pipe characterized by: $${\bf P} = [\ldots f_{i}(y_{i-1}, \ldots y_{i-s}), \ldots]$$ where: $f_{i}(y_{i-1}, \ldots y_{i-s})$ is a function or a program which decides in each step, or cycle", which variable to use in the current computation, selecting\footnote{selection is one of the simplest reduction function} one of the $\{y_{i-1}, \ldots y_{i-s}\}$ variables with the help of an $s$-input selector. The maximum {\em degree of speculation} in this case is the index $s$. \subsubsection{Examples} We now present two typical examples of minimalization, and how speculative composition can be used to solve them and accelerate the computation. \begin{exemplu} Let $f(x,y)$ be a function $f: {\bf N}^{2} \rightarrow {\bf N}$, where ${\bf N}$ denotes the set of positive integers. The function $f$ computes for $x=a$ all the successive positive values $i$ and $i+1$ where $sign(f(a,i)) \neq sign(f(a,i+1))$ (if any). In other words, the function finds indices associated with the different values of $x$ where $f(x)$ has different signs for consecutive indices. \end{exemplu} One way of solving this problem is with a pool of machines performing speculative composition, as illustrated in Figure \ref{comp}b, where $h_{i} = f(a,i)$, followed by an appropriate reduction composition, illustrated in Figure \ref{comp}d. $\diamond$ \begin{exemplu} Consider a pipeline in which we must compute the following conditional expressions in one of the middle stages: \begin{center} {\tt z = z[n-1] ? z + (x-c) : z + (x+c);} \end{center} where {\tt c} is a constant value. \end{exemplu} To maximize speed, the pipeline must evaluate $x-c$ and $x+c$ in parallel, but only one of the resulting values is fed to the next stage. The physical structure of the pipe must be dynamically reconfigured in order to accommodate the use of a 2-PE speculative composition array, as illustrated in Figure \ref{comp}b. If the variable $x$ is computed in Stage $i$: $$x=y_{i}=f_{i}(\ldots)$$ then Stage $i+1$ computes $$x-c=y_{i+1}=f(y_{i})$$ Stage $i+2$ computes $$x+c=y_{i+2}=f(y_{i})$$ and Stage $i+3$ computes $$z=y_{i+3}=f(y_{i+1}, y_{i+2}).$$ The pipe is configured as a cross" using the stages $i+1$ and $i+2$ as a speculative bar" across the pipe. The degree of speculation in this case is $s=2$. $\diamond$ In order to implement speculative execution there is no need for any special features in a data-parallel environment. But for the time-parallel environment simple reduction features, or selection functions\footnote{ This recursive function is called {\em projection} in the theory of partial recursive functions.}, must be added in each pipeline stage. This results in a {\em 2-dimension $n \times s$ pipe}, i.e. an $n$-stage pipe where each stage is able to select its input from the output of the one of the previous $s$ stages. This 2-dimension pipe has a maximum depth of $n$ and the maximum degree of speculation $s$. In conclusion, we see that the minimalization rule can be implemented using all the simple parallel resources, each one the embodiment of simple forms of composition. Next we use the different forms of composition introduced to propose a new taxonomy of parallel architectures. \subsection{Functional Taxonomy of Parallel Computing} The previously identified simple forms of compositions, all summarized in Figure \ref{comp}, yield a functional taxonomy of parallel computation which intersects with Flynn's original taxonomy, and which we categorize as follows: \begin{description} \item[data-parallel computing]: this form uses operators that take vectors as arguments and returns vectors, scalars (by reduction operations) or streams (input values for time-parallel computations). This form is very similar to a Flynn's SIMD machine \cite{Flynn72}. \item[time-parallel computing]: this form uses operators that take streams as arguments and return streams, scalars, or vectors (which can be used sa input values for data-parallel computations). this form is akin to MIMD machines. \item[speculative-parallel computing]: in this form, operators take scalars as arguments and return vectors reduced to scalars using selection. This form is used mainly to speed up time-parallel computations, and it contains a true MISD-like structure. \end{description} Thus, by positioning ourselves in the context of the partial recursive functions model, we argue that we can devise a functional taxonomy of parallel computing which better encompasses today's architectures because it uses {\em functions and variables} rather then instructions and data. Because we now are in a new theoretical framework, we must define what an architecture is in this context, and also the machines that implement this architecture. In the next section we present the concept of {\em Integral Parallel Architecture}, or IPA, as a parallel architecture featuring a multitude of the above mentioned parallel forms. \section{Integral Parallel Architectures \& Low-Power Embedded Computation} It today's technology ring, hi-performance \& low-power embedded systems compete with ASICs. We believe that only way to compete with the ASIC approach is to reduce the granularity of the programmable computing elements to the level of the circuits performing the application oriented functions. To this end, one must rely on small multi-processors, rather than large many-processors. \subsection{Integral Parallel Architecture: A Definition} Two opposite, extreme versions of IPAs exist today: \begin{itemize} \item {\bf complex IPA}s, with all types of parallel mechanisms tightly {\em interleaved}. \item {\bf intensive IPA}s, with parallel mechanisms that are highly {\em segregated}. \end{itemize} Current high-performance processors have a complex IPA, because they implement all types of parallelism in a highly integrated fashion. They are examples of the first type of IPA. Indeed, a pipelined super-scalar machine with speculative execution performs {\em on the same structure} -- a super-speculative pipe -- time-parallel computations, data-parallel computations, and speculative-parallel computations. An intensive IPA, on the other hand, is implemented using distinct hardware support for the simplified, special cases of composition previously emphasized. Because demanding applications cannot be implemented using a single type of parallelism, intensive IPAs are the only solution for building fully programmable, low-power, hi-performance systems-on-a-chip. We continue our exploration of the concept of Intensive IPAs by looking at their implementations \subsection{Intensive IPAs} Actual embedded applications demand intensive data-parallel computation or/and intensive time-parallel computation, often requiring the support of speculative resources. Therefore, the machine implementation of an IPA must provide resources for both. We coin such a machine an {\bf integral parallel machine}, or IPM, and observe that there are two ways of designing them: \begin{itemize} \item with a unique physical support for both data- and time-parallel computation, which we refer to it as a {\bf monolithic IPM}, or \item as two distinct structures for the two types of parallelism, which we refer to as a {\bf segregated IPM}. \end{itemize} We now briefly present each type of machine. \subsubsection{Monolithic integral parallel machine} The main structure of this IPM type is a linear array of processing elements (PE) that operate in two modes: as a data-parallel machine receiving instructions from a controller, or as a pipeline of PEs, each executing its own locally stored program \cite{Mitu05}, \cite{Thiebaut07}. Note that the performance requirements of today's applications will dictate the addition of speculative parallelism to this array of PEs. However, the simplest form of this type of IPM is a linear data-parallel array of $n$ bidirectionally connected PEs, adorned with mechanisms allowing constant time selection ($s$-selectors) to allow $s$-degree speculative executions in pipeline mode. \subsubsection{Segregated integral parallel machine} Segregated integral parallel machines, on the other hand, are chosen when the cost of adding the speculative resources becomes too expensive for a very long pipe. If the number of stages in the pipe $n$ is large (hundreds or thousands) and the degree of speculation $s$ larger than 2, the additional hardware is justified only if these very long pipes are mandated by the overall design. In the other cases, a short pipe with few stages and a reasonably large $s$ can be added as a separate element of the overall design. In cases where the amount of data parallelism is superior to the amount of time parallelism present, a segregated IPM would be preferable. \section{Case Study: {\em The BA1024 Chip}} In this section we present a chip currently manufactured by {\em BrightScale, Inc.}, the {\bf {\em BA1024}}, with at its core a segregated IPM of parameters $n=1024$, $m=8$ and $s=4$. The chip is dimensioned to fully support the computation required for dual HDTV codecs. The chip is a combination of many- with multi-processor architectures. It contains an array of 1024 PEs, alongside a 4-MIPS array working as a MIMD machine. The chip is implemented in $130nm$ standard process, runs at 200MHz, and features: \begin{itemize} \item audio and video interfaces for two HD channels, \item four MIPS processors used as: host processor, video processor, demultiplexing processor, and audio processor, \item a DDR interface sustaining data transfer rates of $3.2GB/sec$ \item and interconnection fabric" allowing transfers of 128-bit words per clock cycle, and \item an accelerator with intensive IPA containing: \begin{itemize} \item 1024 16-bit PEs, each having an integer unit, a boolean unit, and 256 16-bit words of data memory, \item a global loop structure, used mainly to identify the first PU having the selected predicate set to true, \item a reduction tree, used to extract data from the array, or to calculate the number of PEs with a predicate set to true, \item a bidirectional I/O tree, which transfers data to and from the array. The transfer is strictly parallel and concurrent with the main computational process \item 8 16-bit PEs, each with its own program memory. \end{itemize} \end{itemize} Because codec implementation is data intensive, the data-parallel section is very heavily used. The time-parallel section is needed to accelerate purely sequential codec modules, mainly for the H.264 standard. \subsection{Performance} The performance measures of the BA1024 are summarized in Tables 1 and 2. The dot product operation in Table 1 refers to vectors having 1024 16-bit integer components. Note also that the 200 GOPS figure does not include floating point operations, which are emulated in software. In this case 2GFLOPS + 100GOPS can be sustained. The DCT operations in Table 2 are performed on $8 \times 8$ arrays. \begin{table}[h!p] \label{perftable} \begin{center} \begin{tabular}{|l|c|} \hline \bf{Operations} & \bf{Performance} \\ \hline 16-bit operations & 200 GOPS \\ \hline Dot product & 28 clock cycles\\ & $>$ 7GDotProduct/sec \\ \hline External bandwidth & 3.2GB/sec \\ \hline Internal bandwidth & 400GB/sec \\ \hline OPS/power & $>$ 60GOPS/Watt \\ \hline OPS/area & $>$ 2GOPS/$mm^{2}$ \\ \hline \end{tabular} \caption{Overall performance data for the BA1024.} \end{center} \end{table} \begin{table}[h!p] \label{videotable} \begin{center} \begin{tabular}{|l|c|} \hline \bf{Operations} & \bf{Performance} \\ \hline DCT & 0.15 cycles/pixel \\ \hline SAD & 0.0025 cycles/pixel \\ \hline decoding of H.265 & 85\% utilization \\ dual HD stream & of accelerator \\ \hline \end{tabular} \caption{Video performance data for the BA1024.} \end{center} \end{table} \section{Concluding Remarks} We conclude by making the following remarks. \paragraph{Ubiquitous parallelism.} Parallelism is an integral part of today's medium and high performance processors. It is exploited more and more in the two IPA variants that we have presented: complex IPAs, where parallelism appears in general purpose processors, and in intensive IPAs, such as specialized accelerators. Chances are that the term parallel" may become obsolete in the near future, as technological limitations are removed, and all computations end up implementing parallelism in a natural" way. \paragraph{Complex versus intensive IPA.} There is a significant difference between the performance measures of standard processors using complex IPAs and those of machines making use of intensive IPAs, of which the Brightscale BA1024 is an example. Typical values for today's complex IPAs are \begin{itemize} \item 4GIPS + 4GFLOPS \item (0.04GIPS + 0.04GFLOPS)/Watt \item (0.016GIPS + 0.016GFLOPS)/$mm^{2}$ \end{itemize} where an instruction is a 32-bit operation. For intensive IPAs, however, the measures are: \begin{itemize} \item 200GOPS {\bf or} 2GFLOPS+100GOPS \item 60GOPS/Watt \item 2GOPS/$mm^{2}$). \end{itemize} Note that our goal in presenting the BA1024 is solely to emphasize the significant differences between complex and intensive IPAs, and by no means do we claim any superiority of the BrightScale design over that of other companies. \paragraph{Segregation of the complex IPA from the intensive IPA/} Segregation of IPAs is the best solution for optimizing both {\em price(area) versus performance} and {\em power versus performance}. \begin{center} {\bf High Performance Architecture=complex IPA + intensive IPA.} \end{center} Almost all demanding applications would benefit from the use of a new kind of computing architecture. We claim that in the case of such an architecture, {\bf the complex part must be strongly segregated from the intensive part} in order to reach the target performance at a competitive price, and with a minimum amount of dissipated energy. \paragraph{Acknowledgments} We would like to thank Emanuele Altieri, Lazar Bivolarski, Frank Ho, Bogdan M\^{\i}\c tu, Tom Thomson, and Dan Tomescu for insightful discussions on the material presented here. \begin{thebibliography}{nnn} %abcdres % this reference not used %\bibitem[Andonie '07]{} R\v azvan Andonie, Mihaela Mali\c ta, The Connex Array as a Neural %Network Accelerator", accepted at {\em Third IASTED International %Conference on Computational Intelligence, 2007}, Bannf, Alberta, %Canada, July2-4, 2007. \bibitem{Asanovic06} Krste Asanovic, et. all.: {\em The Landscape of Parallel Computing Research: A View from Berkeley}, {\em Technical Report No. UCB/EECS-2006-183}, December 18, 2006. \bibitem{Borkar05} Shekar Y. Borkar, et. all.: {\em Platform 2015: Intel Processor and Platform Evolution for the Next decade}, Intel Corporation, 2005. \bibitem{Dubey05} Pradeep Dubey: {\em A Platform 2015 Workload Model: Recognition, Mining and Synthesis Moves Computers to the Era of Tera}, Intel Corporation, 2005. \bibitem{Flynn72} Flynn, M.J.: Some computer organization and their affectiveness", {\em IEEE Trans. Comp.} C21:9 (Sept. 1972), pp. 948-960. \bibitem{Kleene36} Stephen C. Kleene: General Recursive Functions of Natural Numbers", in {\em Math. Ann}., 112, 1936. \bibitem{Malita06} Mihaela Malivta, Gheorghe \c Stefan, Marius Stoian: "Complex vs. Intensive in Parallel Computation", in {\em International Multi-Conference on Computing in the Global Information Technology - Challenges for the Next Generation of IT\&C - ICCGI 2006}, Bucharest, Romania, August 1-3, 2006. \bibitem{Mitu05} Bogdan M\^{\i}\c tu: private communication. \bibitem{Stefan06} Gheorghe \c Stefan: "The CA1024: A Massively Parallel Processor for Cost-Effective HDTV", in {\em SPRING PROCESSOR FORUM: Power-Efficient Design}, May 15-17, 2006, Doubletree Hotel, San Jose, CA. \& in {\em SPRING PROCESSOR FORUM JAPAN}, June 8-9, 2006, Tokyo. \bibitem{Stefan06b} Gheorghe \c Stefan, Anand Sheel, Bogdan M\^{\i}\c tu, Tom Thomson, Dan Tomescu: "The CA1024: A Fully Programable System-On-Chip for Cost-Effective HDTV Media Processing", in {\em Hot Chips: A Symposium on High Performance Chips}, Memorial Auditorium, Stanford University, August 20 to 22, 2006. \bibitem{Stefan06c} Gheorghe \c Stefan: "The CA1024: SoC with Integral Parallel Architecture for HDTV Processing", in {\em 4th International System-on-Chip (SoC) Conference \& Exhibit}, November 1 \& 2, 2006 - Radisson Hotel Newport Beach, CA. \bibitem{Thiebaut06} Dominique Thi\' ebaut, Gheorghe \c Stefan, Mihaela Mali\c ta: "DNA search and the Connex technology" in {\em International Multi-Conference on Computing in the Global Information Technology - Challenges for the Next Generation of IT\&C - ICCGI 2006}, Bucharest, Romania, August 1-3, 2006. \bibitem{Thiebaut07} Dominique Thi\' ebaut, Mihaela Mali\c ta: Pipelining the Connex array", {\em BARC07}, Boston, Jan. 2007. \bibitem{Xavier98} C. Xavier, S.S. Iyengar: {\em Introduction to Parallel Algorithms}, John Wiley \& Sons, Inc, 1998. \bibitem{Brainerd74} W. S. Brainerd, \& L. H. Landweber, {\em Theory of Computation.}, Wiley, 1974. \end{thebibliography} \end{document} 

## The Draw1_1.lp File

This file was generated by LatexCad, a package for drawing figures in Latex. Other options are available for generating graphs.

% Drawing generated by LaTeX-CAD 1.9 - requires latexcad.sty % (c) 1998 John Leis leis@usq.edu.au \unitlength 1.8pt {\tiny \begin{picture}(112,68) \Thicklines \drawframebox{24.0}{42.0}{16.0}{12.0}{$h_{0}$} \drawframebox{46.0}{42.0}{16.0}{12.0}{$h_{1}$} \drawframebox{78.0}{42.0}{16.0}{12.0}{$h_{m-1}$} \thicklines \drawdot{58.0}{42.0} \thinlines \drawdot{62.0}{42.0} \drawdot{66.0}{42.0} \Thicklines \drawframebox{52.0}{18.0}{32.0}{12.0}{$g$} \thicklines \drawvector{40.0}{28.0}{4.0}{0}{-1} \drawvector{46.0}{36.0}{12.0}{0}{-1} \drawvector{64.0}{28.0}{4.0}{0}{-1} \drawpath{64.0}{28.0}{78.0}{28.0} \drawpath{78.0}{28.0}{78.0}{36.0} \drawpath{40.0}{28.0}{24.0}{28.0} \drawpath{24.0}{28.0}{24.0}{36.0} \drawdot{50.0}{28.0} \drawvector{52.0}{12.0}{6.0}{0}{-1} \drawpath{24.0}{54.0}{78.0}{54.0} \drawpath{52.0}{60.0}{52.0}{54.0} \drawvector{24.0}{54.0}{6.0}{0}{-1} \drawvector{46.0}{54.0}{6.0}{0}{-1} \drawvector{78.0}{54.0}{6.0}{0}{-1} \thinlines \drawdot{46.0}{54.0} \drawdot{52.0}{54.0} \drawcenteredtext{54.0}{62.0}{$x_{0}, x_{1}, \ldots x_{n-1}$} \drawcenteredtext{64.0}{3.0}{{\tt $out = f(x_{0}, x_{1}, \ldots x_{n-1})$}} \drawdot{54.0}{28.0} \drawdot{58.0}{28.0} \end{picture} } \vspace{-0.5cm}