What kind of processor is a single chip with two or more processor cores?

The terms “core,” “processor,” and “CPU” are very confusing due to rapidly changing technology. In many circumstances, these terms become interchangeable. Traditionally, the terms “processor” or “CPU” were very straightforward and simple. They represented a microprocessor chip that implemented a series of processing tasks based on input. They executed the tasks one by one in series. There was not a clear definition of processing tasks that should be included in one CPU.

However, when the clock speed of a microprocessor or CPU starts to hit the heat barrier (see Figure 1.9), microprocessor designers and engineers looked for other alternatives to increase CPU speed, and they came up with the multicore architecture or parallel computing. With a multicore architecture, hardware engineers can jump out of the “heat/performance” dilemma. In other words, the term “multicore” means multiple smaller CPUs within a large CPU.

However, how can we define the core boundary? Referring to the first commercial CPU 4004 architecture, we know that it just consisted of only a few very basic execution blocks or functions, such as an arithmetic logic unit (ALU), instruction fetcher, decoder, register, pipeline interruption handler, and I/O controller unit (see Figure 11.14).

What kind of processor is a single chip with two or more processor cores?

Figure 11.14. The first commercial CPU: Intel 4004 [186].

Later, cache memory was added into big CPUs and the very basic or smaller execution parts of processors could be duplicated. These physical self-contained execution blocks that are built along with shared cache memory are now called “cores.”

Each core is capable of independently implementing all computational tasks without interacting with outside components (they belong to a big CPU), such as the I/O control unit, interrupt handler, etc. which are shared among all the cores.

In summary, a core is a small CPU or processor built into a big CPU or CPU socket. It can independently perform or process all computational tasks. From this perspective, we can consider a core to be a smaller CPU or a smaller processor (see Figure 11.15) within a big processor.

What kind of processor is a single chip with two or more processor cores?

Figure 11.15. Multicore processors.

Some software vendors, such as Oracle, charge license fee on a per core base rather than per socket.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128014134000118

Dark Silicon and Future On-chip Systems

Pejman Lotfi-Kamran, Hamid Sarbazi-Azad, in Advances in Computers, 2018

3.1 Lack of Parallelism

Moving from single-core processors to multicore processors was mainly motivated by the fact that many applications are inherently parallel and can benefit from execution on multiple cores. While this intuition is generally true, different applications offer different levels of parallelism. The fraction of an application that can be executed in parallel is referred to as its level of parallelism. This model is very simple and assumes that parallel fraction of an application is infinitely parallelizable. With this simple model, the level of parallelism of different applications cover a full range between 0% (not parallelizable) and 100% (fully parallelizable).

It is clear that multicore era is not useful for applications with limited parallelism. However, even for applications that are highly parallelizable, the serial fraction of applications (i.e., the fraction that is not parallelizable and must be run serially on a single core) ultimately limits the performance that one can get using parallel execution. A simple formula captures the relationship between the level of parallelism, core count, and speedup. This formula is often referred to as Amdahl's law [13].

(4)S=1/1−p+p/n

Amdahl's law assumes that an application can be divided into two parts: serial part and parallel part. In Eq. (4), p is the fraction of the execution time of parallel part to the whole execution time when the application runs on a single core, n is the number of cores, and S is the speedup that one gets if the application runs on n cores.

Fig. 1 shows the speedup for various numbers of cores for three applications with level of parallelism of 0.9, 0.99, and 0.999. The figure shows that the line corresponding to the application with 90% parallelism is essentially flat. The speedup of this application with 1024 cores is just 10×. For an application with 99% parallelism, we observe higher speedup as compared to the application with 90% parallelism. However, even for such an application, the speedup is just 91× when the application executes on 1024 cores. The highest speedup is observed for an embarrassingly parallel application with 99.9% parallelism. The speedup for this application is significantly higher than the other two applications. Even for such an embarrassingly parallel application, the speedup is just 506× when it runs on 1024 cores. This means that most of the cores (i.e., 1024–506 = 518) are not utilized when such an application runs on 1024 cores.

What kind of processor is a single chip with two or more processor cores?

Fig. 1. Speedup of three applications with 0.9, 0.99, and 0.999 parallel fraction on various numbers of cores.

Unfortunately, usually achieving a level of parallelism of 99% is difficult. Many parallel applications offer less than 90% parallelism. Making such applications more parallel is extremely difficult and requires enormous effort. Moreover, Fig. 1 shows that it is becoming more and more difficult to utilize all the cores of a multicore processor as the number of cores increases, even for embarrassingly parallel applications. While there are applications with 100% parallelism (e.g., some data center cloud applications like Web Search and media streaming), many parallel applications offer parallelism below 100%. Fig. 1 shows that such applications are inherently incapable of utilizing all the cores of a many-core processor.

When there are only few cores available, the inherent parallelism in applications is sufficient to utilize the cores. However, when the number of cores increases, even many embarrassingly parallel applications are incapable to utilize all the cores. This phenomenon leads to the utilization wall in which applications cannot utilize all the cores effectively. In order to increase the utilization, we need to rewrite the applications to offer higher levels of parallelism. Unfortunately, this task is difficult and time-consuming, and even sometimes impossible.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0065245818300147

Dark Silicon and Future On-chip Systems

Mehdi Modarressi, Hamid Sarbazi-Azad, in Advances in Computers, 2018

4 Specialized NoC for Specialized Cores

In a many-core processor in the dark silicon era, with a vast majority of on-chip cores left powered off (or dark), it is very likely that active cores should be noncontiguous and distributed across the network in an irregular way.

If the NoC is designed and customized for a single application (or a group of similar applications), the cores that best match the processing requirements of each task of the application are selected (or generated) first and then a mapping algorithm performs the core to NoC node mapping selectively. The major objective of most mapping algorithms is to place the cores that communicate more often (and with high volume) close to each other [65]. However, modern general-purpose many-core processors employ a larger number of increasingly diverse applications (often unknown at design time) with potentially different traffic patterns. In such systems, it is not possible to find proper core-to-NoC mapping for all applications. For example, Fig. 1 shows the cores of a heterogeneous MPSoC that are activated to run two different applications. Each application activates and runs on those cores that best match its tasks. If the NoC was designed and customized for each individual application, the cores would be mapped into adjacent nodes. However, in a general-purpose platform, it is very likely that the preferred cores should be nonadjacent. The problem in this case is that the application and the intertask traffic may not be known at design time, when the mapping is performed. Even for the target applications that are specified a priori, it is often infeasible to find a mapping that is suitable for all applications, as intercore traffic pattern varies significantly across applications.

What kind of processor is a single chip with two or more processor cores?

Fig. 1. A heterogeneous CMP with specialized cores. Active cores when playing game (A) and listening to music (B).

In this case, specialized cores for a particular application domain or class of applications (e.g., multimedia applications) are placed in a contiguous region of the chip area. However, the mapping of cores inside the region may not be optimal for all of the applications in that domain, as each application uses the cores with its own traffic pattern.

Fig. 2 shows a region of a many-core chip that is assigned to the cores required by a multimedia application set [20]. The application set contains H263 encoder and decoder and MP3 decoder and encoder. The mapping of cores inside the region is done by the algorithm we presented in a previous work [20] which is designed to map multiple applications with different intercore communication patterns. As Fig. 2 shows for two different applications of this domain, they activate different cores of the region and we still need topology optimization for such region-based NoCs.

What kind of processor is a single chip with two or more processor cores?

Fig. 2. A region in a heterogeneous CMP with specialized cores for the MMS applications and active cores of the region when running MP3 decoder (A) and H263 encoder (B) [10]. The communication task graph related to each application is also illustrated.

Our previous study demonstrates that for two applications, namely x and y, that use the same set of cores but with 50% difference in intercore traffic, running x on an NoC whose mapping is optimized for y increases the communication latency by 30%–55%, compared to customizing mapping for x [10].

In a partially active CMP, conventional NoCs still necessitate all packets generated by active cores to go through the router pipeline in all intermediate nodes (both active and inactive) in a hop-by-hop basis. In this case, many packets may suffer from long latencies, if active nodes are located at a far topological distance.

Due to the dynamic nature of core utilization pattern in such processors, in which the set of active cores varies over time, a reconfigurable topology is an appropriate option for adapting to the changes in network traffic. Among the existing reconfigurable topologies [20–23], we focus on the architecture we proposed in a previous work [20] (RecNoC, hereinafter), as it provides a more appropriate trade-off between flexibility and area overhead than the others. As mentioned before, RecNoC relies on embedding configuration switches into a regular NoC to dynamically change the interrouter connectivity. In this work, we show how the routers of the dark regions of a chip can be used as configuration switches and achieve the same level of power reduction and performance improvement as RecNoC, largely without paying its area overhead. The proposed NoC provides reconfigurable multihop intercore links among active cores by using the router of dark cores as bypass paths. This enables NoC to operate in the same way as a customized NoC, in which the cores are placed at nearby nodes.

Obviously, with the proposed reconfiguration, the physical distance of the cores is not changed. However, this topology adaption mechanism reduces communication power and latency significantly by reducing the number of intermediate routers. The latency and power usage of routers often dominate the total NoC latency and power usage. For example, in Intel's 80-core TeraFlops, routers account for more than 80% of the NoC power consumption, whereas links consume the remaining 20% [66].

Existing NoC topologies range from regular tiled-based [5,66,67] to fully customized structures [68–70]. Regular NoC architectures provide standard structured interconnects that often feature high reusability and short design time/effort. On the other hand, being tailored to the traffic characteristics of one or several target applications, customized topologies offer lower latency and power consumption when running that applications. However, topology customization will transform the regular structure of standard topologies into a nonreusable ad hoc structure with many implementation issues such as uneven wire lengths, heterogeneous routers, and long design time. Since our proposal realizes application-specific customized topologies on top of structured and regular components, it stands between these two extreme points of the topology design to get the best of both worlds; it is designed and fabricated as a regular NoC, but can be dynamically configured to a topology that best matches the core activation pattern of a partially active multicore processor.

In the next section, we review some related research proposals on NoC power/performance optimization and then present our dark-silicon aware NoC architecture.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0065245818300226

Machine learning

Jim Jeffers, ... Avinash Sodani, in Intel Xeon Phi Processor High Performance Programming (Second Edition), 2016

Threading and Work Partitioning

For a many-core processor like Knights Landing, exposing sufficient parallelism and partitioning work into threads becomes a crucial design task. There are two considerations for threading: (a) The work partitioning should be such that the output produced by any two threads is ideally nonoverlapping. (b) In cases where the job cannot be partitioned into threads with nonoverlapping outputs, we privatize outputs and follow up the convolution layer operation with synchronization and reduce operation.

We first examine the forward-propagation operation. Here the output activation consists of OFM*OFH*OFW neurons per image and a total of MINIBATCH* OFM*OFH*OFW neurons, each of which can be computed independently. Hence we can potentially have MINIBATCH*OFM*OFH*OFW (147456 work items for MINIBATCH = 1) work items, which can be easily partitioned into hundreds of threads. However, we recall that vectorization and register blocking require us to at least have each work item to have SIMD_WIDTH output feature maps, and RB_SIZE elements along the width of the feature map. Hence the number of work items is reduced to (MINIBATCH*OFM*OFH*OFW)/(RB_SIZE*SIMD_WIDTH) work items. We can simplify this further by partitioning along work items of size: SIMD_WIDTH*OFW instead of RB_SIZE*SIMD_WIDTH. For the forward-propagation, we continue to still have a large number of work items (768 for OverFeat C5 layer for MINIBATCH = 1), which can for example be divided into 132 threads with a load balance of 97% ((768/132)/ceiling (768/132)). While the Knights Landing preproduction part we use has 68 cores, we use only 66 of them and leave two cores for the operating system, communication management, and I/O operations. Hence the compute is performed by 132 threads running on 66 cores.

A similar work partitioning happens for the back-propagation and the threading strategy is again identical. The “gradient of inputs” (del_input) is partitioned into (IFM*IFH*IFW)/(IFW*SIMD_WIDTH) work items which are the distributed among threads.

For the weight-gradient updates, the number of work items can be considerably small when the number of input and output feature maps is low. For the OverFeat C5 layer, the number of work items (recall KH = KW = 3) with due consideration to the register blocking strategy leads to creation of: (OFM*IFM*KH*KW)/(KH*KW*SIMD_WIDTH*2) or 32768 work items. This yields to good load balance. However, for the C1 layer with OFM = 96, IFM = 3, KH = KW = 11, we have (OFM*IFM*KH*KW)/(SIMD_WIDTH*KW*IFM) = 66. This can fortunately be divided into 66 threads equally to attain good load balance, but a slight change in topology could lead to up-to 30% load imbalance for this layer. In such cases, we need to privatize the “gradient of weights” array and follow it up with a synchronization and reduce operation.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128091944000247

Power-Efficient Network-on-Chips: Design and Evaluation

Fawaz Alazemi, ... Bella Bose, in Advances in Computers, 2022

9 Conclusion

Current and future many-core processors demand highly efficient on-chip networks to connect hundreds or even thousands of processing cores. In this work, we analyze on-chip wiring resources in detail and propose a novel routerless NoC design to remove the costly routers in conventional NoCs while still achieving scalable performance. We also propose an efficient interface hardware implementation and evaluate the proposed scheme extensively. Simulation results show that the proposed routerless NoC design offers significant advantage in latency, throughput, power, and area, compared with other designs. These results demonstrate the viability and potential benefits of the routerless approach and also call for future works that continue to improve various aspects of routerless NoCs such as performance, reliability, and power efficiency.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0065245821000760

Embedded Software in Real-Time Signal Processing Systems: Design Technologies 

GERT GOOSSENS, ... MEMBER, IEEE, in Readings in Hardware/Software Co-Design, 2002

A A Paradigm Shift from Hardware to Software

By increasing the amount of software in an embedded system, several important advantages can be obtained. First, it becomes possible to include late specification changes in the design cycle. Second, it becomes easier to differentiate an existing design, by adding new features to it. Finally, the use of software facilitates the reuse of previously designed functions, independently from the selected implementation platform. The latter requires that functions are described at a processor-independent abstraction level (e.g., C code).

There are different types of core processors used in embedded systems.

General-purpose processors. Several vendors of off-the-shelf programmable processors are now offering existing processors as core components, available as a library element in their silicon foundry [4]. Both microcontroller cores and digital signal processor (DSP) cores are available. From a system designer's point of view, general-purpose processor cores offer a quick and reliable route to embedded software, that is especially amenable to low/medium production volumes.

Application-specific instruction-set processors. For high-volume consumer products, many system companies prefer to design an in-house application-specific instruction-set processor (ASIP) [1], [3]. By customizing the core's architecture and instruction set, the system's cost and power dissipation can be reduced significantly. The latter is crucial for portable and network-powered equipment. Furthermore, in-house processors eliminate the dependency from external processor vendors.

Parameterizable processors. An intermediary between the previous two solutions is provided by both traditional and new “fabless” processor vendors [5][7] as well as by semiconductor departments within bigger system companies [8], [9]. These groups are offering processor cores with a given basic architecture, but that are available in several versions, e.g., with different register file sizes or bus widths, or with optional functional units. Designers can select the instance that best matches their application.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781558607026500399

Intel® Core™ Processors

In Power and Performance, 2015

3.2.1 Intel® HD Graphics

Prior to the Second Generation Intel Core processor family, Intel® integrated graphics were not part of the processor, but were instead located on the motherboard. These graphics solutions were designated Intel® Graphics Media Accelerators (Intel® GMA) and were essentially designed to accommodate users whose graphical needs weren’t intensive enough to justify the expense of dedicated graphics hardware. These types of tasks included surfing the Internet or editing documents and spreadsheets. As a result, Intel GMA hardware has long been considered inadequate for graphically intensive tasks, such as gaming or computed-aided design.

Starting with the Second Generation Intel Core processor family, Intel integrated graphics are now part of the processor, meaning that the GPU is now an uncore resource. While the term “integrated graphics” has typically had a negative connotation for performance, it actually has many practical advantages over discrete graphics hardware. For example, the GPU can now share the Last Level Cache (LLC) with the processor’s cores. Obviously, this has a drastic impact on tasks where the GPU and CPU need to collaborate on data processing.

In many ways, the GPU now acts like another core with regards to power management. For example, starting with the Second Generation Intel Core processor family, the GPU has a sleep state, RC6. Additional GPU sleep states, such as RC6P, were added in the Third and Fourth Generation Intel Core processors. This also means that for the processor to enter a package C state, not only must all the CPU cores enter deep sleep, but also the GPU. Additionally, the GPU is now part of the processor’s package’s TDP, meaning that the GPU can utilize the power budget of the cores and that the cores can utilize the power budget of the GPU. In other words, the GPU affects a lot of the CPU’s power management.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128007266000033

Knights Landing overview

Jim Jeffers, ... Avinash Sodani, in Intel Xeon Phi Processor High Performance Programming (Second Edition), 2016

Overview

Knights Landing is a many-core processor that delivers massive thread and data parallelism with high memory bandwidth. It is designed to deliver high performance on parallel workloads. Knights Landing provides many innovations and improvements over the prior-generation Intel Xeon Phi coprocessor code named Knights Corner. Knights Landing is a standard Intel Architecture standalone processor that can boot stock operating systems and connect to a network directly via prevalent interconnects such as Infiniband, Ethernet, or Intel® Omni-Path Fabric (Chapter 5). This is a significant advancement over Knights Corner, which is restricted as a PCIe-connected device and, therefore, Knights Corner could only be used when connected to a separate host processor. Knights Landing introduces a more modern, power efficient core that triples (3 ×) both scalar and vector performance compared with Knights Corner. Knights Landing offers over three TFLOP/s of double precision and six TFLOP/s of single precision peak floating point performance. It introduces a new memory architecture utilizing two types of memory (MCDRAM and DDR) to provide both high memory bandwidth and large memory capacity. It is binary compatible with prior Intel® processors. This means that it can run the same binary executable programs that run on current x86 or x86-64 processors without requiring any modification to the binary executable programs. Knights Landing introduces the 512-bit Advanced Vector Extensions known as Intel® AVX-512. These new 512-bit vector instructions are architecturally consistent with the previously introduced 256-bit AVX and AVX2 vector instructions. The AVX-512 instructions will be supported in future Intel® Xeon® processors as well. Knights Landing introduces optional on-package integration of the Intel® Omni-Path Architecture enabling direct network connection with improved efficiency compared to an external fabric chip or card.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128091944000028

The Process View

Richard John Anthony, in Systems Programming, 2016

2.12.1 Questions

1.

Consider a single-core processor system with three active processes {A, B, C}. For each of the process state configurations a-f below, identify whether the combination can occur or not. Justify your answers.

ProcessProcess state combinationabcdefARunningReadyRunningBlockedReadyBlockedBReadyBlockedRunningReadyReadyBlockedCReadyReadyReadyRunningReadyBlocked

2.

For a given process, which of the following state sequences (a-e) are possible? Justify your answers.

(a)

Ready → Running → Blocked → Ready → Running → Blocked → Running

(b)

Ready → Running → Ready → Running → Blocked → Ready → Running

(c)

Ready → Running → Blocked → Ready → Blocked → Ready → Running

(d)

Ready → Blocked → Ready → Running → Blocked → Running → Blocked

(e)

Ready → Running → Ready → Running → Blocked → Running → Ready

3.

Consider a compute-intensive task that takes 100 s to run when no other compute-intensive tasks are present. Calculate how long four such tasks would take if started at the same time (state any assumptions).

4.

A non-real-time compute-intensive task is run three times in a system, at different times of the day. The run times are 60, 61, and 80 s, respectively. Given that the task performs exactly the same computation each time it runs, how do you account for the different run times?

5.

Consider the scheduling behavior investigated in Activity P7.

(a)

What can be done to prevent real-time tasks from having their run-time behavior affected by background workloads?

(b)

What does a real-time scheduler need to know about processes that a general-purpose scheduler does not?

What name is used for a processor with 2 cores?

A processor with two cores is called a dual-core processor; with four cores, a quad-core; six cores, hexa-core; eight cores, octa-core.

What is meant by 2 cores in processor?

What Does Dual Core Mean? Dual core is a CPU that has two distinct processors that work simultaneously in the same integrated circuit. This type of processor can function as efficiently as a single processor but can perform operations up to twice as quickly.

Which of the following is a single chip with two or more separate processors?

A multi-core processor is a single chip with two or more separate processor cores. Multi-core processors are used in all sizes of computers. Processors contain a control unit and an arithmetic logic unit (ALU).

What are the 2 types of processors?

There are two main types of processor:.
RISC (reduced instruction set computer) processors are used in modern smartphones and tablets. This type of processor can carry out simple instructions quickly. ... .
CISC (complex instruction set computer) processors are used in desktop and laptop computers..