# **ORIGINAL RESEARCH**



# **A 1.0 fJ energy/bit single‐ended 1 kb 6T SRAM implemented using 40 nm CMOS process**



1 Department of Electrical Engineering, National Sun Yat‐Sen University, Kaohsiung, Taiwan

2 Institute of Undersea Technology, National Sun Yat‐Sen University, Kaohsiung, Taiwan

3 Department of Photonics, National Sun Yat‐Sen University, Kaohsiung, Taiwan

4 Department of Engineering Science, National Cheng Kung University, Tainan, Taiwan

5 Department of Electronics Engineering, Batangas State University‐ The National Engineering University, Batangas City, Philippines

#### **Correspondence**

Chua‐Chin Wang, Department of Electrical Engineering, National Sun Yat‐Sen University, No. 70, Lian‐Hai Rd., Kaohsiung City 80424, Taiwan. Email: [ccwang@ee.nsysu.edu.tw](mailto:ccwang@ee.nsysu.edu.tw)

Yu‐Cheng Lin, Department of Engineering Science, National Cheng Kung University, No. 1, University Rd., Tainan City 70101, Taiwan. Email: [yuclin@mail.ncku.edu.tw](mailto:yuclin@mail.ncku.edu.tw)

#### **Funding information**

The National Science and Technology Council of Taiwan, Grant/Award Numbers: MOST109‐2218‐ E‐110‐007, 108‐2218‐E‐110‐011, 108‐2218‐E‐110‐ 002, 107‐2218‐E‐110‐002, 110‐2221‐E‐110‐063‐ MY2; National Applied Research Laboratories

#### **Abstract**

An ultra‐low‐energy SRAM composed of single‐ended cells is demonstrated on silicon in this investigation. More specifically, the supply voltages of cells are gated by wordline (WL) enable, and the voltage mode select (VMS) signals select one of the corresponding supply voltages. A lower voltage is selected to maintain stored bit state when cells are not accessed, lowering the standby power. And when selecting a cell (i.e. WL is enabled) to perform the read or write (R/W) operations, the normal supply voltage is used. A 1‐kb SRAM prototype based on the single-ended cells with built-in self-test (BIST) and powerdelay production (PDP) reduction circuits was realised on silicon using 40‐nm CMOS technology. Theoretical derivations and simulations of all‐PVT‐corner variations are also disclosed to justify low energy performance. Physical measurements of six prototypes on silicon shows that the energy per bit is 1.0 fJ at the 10 MHz system clock.

#### **KEYW ORDS**

digital integrated circuits, logic design, low‐power electronics, memory architecture, VLSI

# **1** | **INTRODUCTION**

Memory devices are known to be second only to CPU/MPU in terms of overall timing parameter performance of digital subsystems in electronic products. It will soon occupy 90% of the entire area in SOC (system on chip) according to the ITRS report [\[1\]](#page-11-0). Low‐power memory devices will cut the total power consumption of these items, especially those that are battery powered and portable. Unlike DRAM, which is widely used as the main memory mechanism, SRAM has been used in most of CPU/MPUs as cache devices to speed up the access of recently used data. The performance of SRAM, regardless of its usage, has a significant influence on the power dissipation, which affects the overall efficiency of the system. Thus, it is

critical that SRAMs used in CPU/MPU have an energy saving feature to minimise their effect on its performance. Several variety of SRAM designs have been reported in past decades. Three key design methods were presented, specifically for the energy‐saving and power demands of SRAMs:

1). Current‐mode sense amplification [[2,](#page-11-0) 3]: Since CMOS technology has been scaled down very quickly, the bitline capacitance has become too large for an SRAM cell to drive. During read operations, a sense amplifier predetermines the output result by sensing the differential current on two bitlines, enabling high-speed and lowpower operation. In this case, the bitline capacitance has less of an impact on the output delay.

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

<sup>©</sup> 2023 The Authors. *IET Circuits, Devices & Systems* published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

- <span id="page-1-0"></span>2). Secondary supply [[4\]](#page-11-0): Adding an extra supply voltage, higher than the nominal VDD, can improve the access speed of the SRAM at the expense of a higher energy cost. The energy or standby power of the cells that are not accessed are often ignored.
- 3). Current compensation [\[5,](#page-11-0) 6]: When the SRAM is turned on, leakage current is detected by a current compensation circuit in each bitline to inject an additional current into the associated bitline. Thus, the SRAM's access speed is enhanced, even if the leakage current is not reduced. This method has little energy‐saving benefits, especially in standby cells.

A 4T loadless SRAM has been reported for lower power usage, where high-threshold voltage transistors are used in data latches and low‐threshold voltage transistors are used in bitline drivers [[7\]](#page-11-0), also called a P‐latch N‐drive 4T SRAM cell. Despite having self-refreshing paths to keep the stored bit state, instability and read/write disturbance become a threat to the access operations. These threats are particularly strong when loadless designs lacks any bitline isolation mechanism. What's even worse is that the weakening of the static noise margin (SNM), as mentioned in the works by Wang *et al.* [\[8\]](#page-11-0), proved to be a hazard for such a cell structure. These SRAMs were found to be even more vulnerable when the supply voltage is lower. To resolve this issue, readout assist circuits were proposed for single-ended SRAMs [\[9\]](#page-11-0). This breakthrough accelerated the research of non‐symmetrical R/W auxiliary circuit designs that are meant to provide disturbance isolation from bit‐lines, for example, reports of Chen *et al.* [\[10\]](#page-11-0) The disturbance isolation design becomes even more critical if the SRAM is meant to be fabricated using advanced CMOS technology nodes (e.g. <100 nm), or the SRAM cell is operating near the subthreshold region. The SRAMs that have write‐assist loops were typical examples to demonstrate the disturbance-free feature [[10,](#page-11-0) 11]. The two examples, however, used the design methodology for symmetrical R/W, and hence cannot be applied to single-ended SRAM cells. Other methods, such as the usage of asymmetric

Schmitt-trigger inverters, are also possible to improve cell performance [\[12\]](#page-11-0).

In order to meet the low power dissipation demands for SRAMs implemented in advanced CMOS process, a supply voltage gate‐control mechanism for every column of SRAMs is proposed in this work, wherein two supplies with different voltages are used and selected by WL (wordline) and associated signals to decrease the power dissipation while on standby. To compensate for the loss of R/W speed caused by reduced supply voltage, a voltage boost is given to the driving gate of the selected SRAM cells to provide speed and slew rate improvement. Detailed post‐layout simulations and physical on‐silicon measurements are demonstrated to justify the low power/energy feature. The proposed SRAM is fabricated using a typical 40‐nm CMOS process, where 1.0 fJ energy/bit is measured at a 10 MHz system clock with an access delay of 52 ns.

# **2** | **LOW ENERGY SRAM WITH SINGLE‐ ENDED CELLS**

Referring to Figure 1, the proposed SRAM design consists of a memory array, a control circuit, a row and column decoder, a column select circuit, a build‐in self‐test (BIST) circuit, a power‐delay product (PDP) reduction circuit, and a V*DD* select circuit. The supply voltage of the SRAM cells is selected by the V*DD* select circuit. A pass‐transistor gate voltage boosting (PVB) and adaptive voltage detector (AVD) circuits make up the PDP reduction circuit. The functions of major signals in the proposed SRAM are summarised as follows:

- 1. Bit\_Addr[4:0]: bitline addresses
- 2. Word\_Addr[4:0]: wordline addresses
- 3. WR\_EN: write/read enable (1/0)
- 4. Data\_out, Data\_in: data output and data input, respectively
- 5. CLK: system clock
- 6. BS: boost select



**FIGURE 1** 1‐kb SRAM with single‐ended cells system block diagram

- <span id="page-2-0"></span>7. VMS: voltage mode select
- 8. BIST\_Pass: BIST pass (or not)
- 9. BIST\_EN: BIST enable

### **2.1** | **Single‐ended SRAM cell circuit analysis**

Works related to the reduction of leakage has been introduced in 8T SRAM cells through the use of low- $V_{th}$  auto-gating transistors to minimise leakage at the expense of lower speeds [\[13\]](#page-11-0). Meanwhile in the works of Chen *et al.* [[10](#page-11-0)], a 5T single-ended cell was reported, which introduced a cell isolation mechanism to prevent noise interference. However, leakage current was a main issue, causing retention fault, particularly for advanced CMOS processes. This issue was not present in the works of Terada *et al.* [\[14\]](#page-11-0), because they introduced a transistor in their design to act as a leakage bypass. This was at the expense of a larger area overhead.

All the SRAM cells discussed before still exhibit high standby power, since all of the idle cells are directly coupled to regular power, thus consuming a significant amount of standby power. A new cell‐column structure was presented in our previous report [[15\]](#page-12-0), as presented in Figure 2. This was to reduce the standby power, and in turn the overall power dissipation, when most cells are not accessed. Though the cell with the associated power-gated mechanism was described in the mentioned article, operation details, theoretical analysis, and the on‐silicon verification were never disclosed. The proposed design operates as follows:

1). In the event that any cell is being accessed, WLB goes low and WL goes high. Transistor  $M_{301}$  is then turned on, providing the regular  $V_{DD}$  to the cells in the same column.



**FIGURE 2** Proposed ultra‐low power 6T SRAM cell with power gating

2). If all the cells in the column were not being accessed, WLB goes high and WL goes low. A lower supply voltage,  $V_{DD} - V_{tbp(M306)}$ , is coupled to all cells' supply nodes in the same column. That is, the voltage supplied to the cells are decreased by the threshold voltage of transistor  $M_{306}$  to save power and still maintain the stored bit states.

Let us consider a typical CMOS process for our SRAM design vehicle. The supply voltage V*DD* is 0.8 V for the typical 40‐nm CMOS process such that the reduced supply voltage for those unaccessed cells becomes  $V_{DD}$ – $V_{thp}$  = 0.68 V. Referring to Figure 2, the current supplied through the low‐Vth PMOS devices is limited by the width thereof. Thus, auxiliary circuits driven by a VMS signal are needed to prevent possible R/W errors caused by the insufficient supply current.

- Access operation:  $WL = 1$  and  $WLB = 0$ . VMS1 = VMS (= 0) or WLB = 0 such that  $M_{305}$  is turned on to supply extra current.
- $\bullet$  Hold operation: WL = 0 and WLB = 1. VMS2 = VMS (= 0) or  $WL = 0$  such that  $M_{310}$  and  $M_{315}$  are on to supply extra current.
- $\bullet$  No auxiliary: VMS = 1. All of the auxiliary circuits are off. This is for the purpose of testing to validate the proposed power‐gated mechanism.

## **2.2** | **SRAM cell transistor sizing**

The transistor sizes of the SRAM cells are determined by the current that will pass through the transistors for every operation. To attain reliable  $Q$  and  $Q_b$  with a symmetric feature, currents through  $M_{201}$  and  $M_{202}$  must be the same. Hence,  $(W/L)_{201}$  =  $(W/L)_{202}$ . Transistor M<sub>206</sub> will drain the current when  $Q = 0$ such that  $(W/L)_{206}$  is chosen to be the minimum size to have the lowest current passing to it. The write-assist transistor  $M_{203}$ should be able to draw the same current passing to  $M_{201}$  when writing logic '1'. Transistors  $M_{203}$  and  $M_{204}$  are equally sized to have the ratio of  $M_{201}$  versus  $M_{203}$  equal to that of  $M_{202}$  versus  $M_{204}$ . Access transistor  $M_{205}$  is chosen to be equal to  $M_{203}$  versus  $M<sub>204</sub>$ , since it should have current passing to them while doing write operation, which is equal.

To further decrease power consumption, transistors inside the cells are chosen to have high‐V*th*, while the write‐assist and access transistors are chosen to have low‐V*th* to have faster access and write operations.

#### **2.3** | **Analysis of power dissipation and area overhead**

Every column of memory cells in the proposed design has a power‐gated mechanism. The biggest challenge is to optimise the size of the gating transistors to retain the correct operating region and reduction of power dissipation at the same time. Awkward scenarios at the FS corner (fast NMOS, slow PMOS) might be a problem of this power‐gated mechanism, since slow

PMOS devices provide weak current. The expected result is that it will be hard for  $Q_b$  to stay high (as  $Q = 0$ ). This current shortage issue can be overcome through analytical solutions to know what the minimal PMOS size should be provided that the number of cells is given. Assume that every column has a total of *n* cells. Assume that *Iact* stands for the current needed by the cell to access, and *Iidl* denotes the required current by idle cells. The power PMOS device's drain current should satisfy the following equation:

$$
ID \ge \text{Iact} + (n - 1) \times \text{Iidl} \tag{1}
$$

If  $n = 32$  and the implementation is on typical 40-nm CMOS technology, the total required current for cells being accessed in a column is  $I_D = 39.5 \mu A = 31.5 \mu A$  (1 cell accessed)  $+31 \times 245$  nA (other 31 cells idle). According to the saturation current  $I_D$  equation for MOS transistors, the minimal width of the power PMOS to supply such a current is 3.75  $\mu$ m. Therefore, a total of 5×750-nm PMOS transistors in parallel are used, which are  $M_{301}-M_{305}$ . Similarly, the number and the size of those power‐gated transistors for idle cells in Figure [2](#page-2-0), namely  $M_{306} - M_{315}$  can be determined as well.

Based on the proposed power‐gated mechanism to decrease the standby power for most of the cells, the layouts of the proposed ultra‐low power 6T SRAM cell and the power‐ gated mechanism circuit are shown in Figures 3 and 4, respectively, where the area of the single cell is  $1.8 \times 2.1 \text{ }\mu\text{m}^2$ , and that of the power-gated mechanism circuit is 2.8  $\times$  3.6  $\mu$ m<sup>2</sup>. Namely, the area overhead cost is  $8.33\% = \frac{2.8 \times 3.6}{32 \times 1.8 \times 2.1}$  for every 32 cells, if the area of wiring is not taken into account, to share 1 power‐gated mechanism circuit. If the length of the cell array is increased to 1024, the area penalty becomes only 0.26%, which is considered relatively small and negligible.

Referring to Figure 5, the layout size comparison between the proposed cell to a conventional 6T cell, both cells have similar transistor sizes, and the proposed cell occupied 21% larger area compared to a conventional cell layout, since the proposed cell



**FIGURE 3** Layout of the proposed ultra‐low‐power 6T SRAM cell

used HVT and LVT devices such that the shared diffusion layout (as what is done for the conventional cell) cannot be utilised to minimise the area of the cell.

The proposed ultra-low-energy 6T single-ended SRAM cell design with the power‐gated mechanism does not need a sense amplifier (SA) like prior single‐ended SRAMs [\[10,](#page-11-0) 14]. In contrast, traditional SRAMs [\[13\]](#page-11-0) need SAs to accelerate access operations, because they use the bitline and bitline together to access the data nodes. This is another reason why the proposed SRAM has to be more energy efficient.

#### **2.4** | **Read/write cycles**

Read/write cycles of the proposed SRAM are described as follows. The read cycles are shown in Figure [6](#page-4-0). The read operation in the cell is shown in Figure [7.](#page-4-0)



**FIGUR E 4** Layout of the proposed power‐gated mechanism



**FIGUR E 5** Layout size comparison (a) proposed 6T cell; (b) conventional 6T cell

<span id="page-4-0"></span>

**FIGURE 6** Read cycle timing diagram



**FIGURE 7** Read operation

- � Before any operation, predischarge will cause BLB to be grounded to prevent state "0" from being disrupted by leakage and noise.
- � The matching decoders choose the cell once if the row and column addresses are available.
- $\bullet$  WA and WL are then pulled high, turning  $M_{204}$  ( $M_{203}$ ) and M205 on. Regardless of Read1 or Read0, WAB is pulled low to turn  $M<sub>203</sub>$  off. Qb will then be coupled to BLB through  $M_{205}$  and  $M_{204}$ .

The write operation timing diagram is shown in Figure 8. The write-0 and write-1 operations are shown in Figures 9 and 10, respectively.

� Write‐0: WA is set to low and WAB is then pulled high to turn transistors  $M_{203}$  and  $M_{204}$  on and off, respectively. Q is then pulled down to the ground using the predischarge signal.



**FIGUR E 8** Write cycle timing diagram



**FIGUR E 9** Write‐0 operation



**FIGUR E 1 0** Write‐1 operation

 $\bullet$  Write-1: WA is pulled high to turn transistor  $M_{204}$  on. WAB is low to turn transistor  $M<sub>203</sub>$  off. Predischarge then pulls Q high by pulling Qb down.

Table 1 tabulates the overall R/W operation and related control signals of the proposed SRAM.

## **2.5** | **Hold/standby operation**

The hold/standby operation is also shown as part of the read cycle in Figure [6.](#page-4-0) All the access transistors  $(M_{203}, M_{204},$  and  $M<sub>205</sub>$  are disconnected to isolate the memory cell. During this operation, the power gating circuit is also enabled to reduce the supply voltage of the inactive cells. This is shown in Figure 11. A reduced voltage of around 0.68 V will now be used to supply the cells. This ensures the low power standby operation of the cells.

Since the low‐Vth access transistors are used in the design, it is important that these transistors do not leak in the worst process corner. A long transient simulation during hold operation is presented in Section [3](#page-6-0) to show that the design will not droop during the worst-case corner.

#### **2.6** | **PDP reduction circuit**

Aside from the power‐gating method used in the preceding sections for each column of cells, a power-delay product (PDP,

**TABLE 1** Read and write operation

|              | Write 1 | Write 0  | Read $(1/0)$ | Standby  |
|--------------|---------|----------|--------------|----------|
| Predischarge |         |          | 0            |          |
| WL           | 1       |          | 1            | 0        |
| WA           |         | $\Omega$ | 1            | $\theta$ |
| <b>WAB</b>   | 0       |          | 0            | 0        |
| BL           | 1       | 1        | 1/0          |          |
| <b>BLB</b>   | 0       | 0        | 0/1          | $\Omega$ |



**FIGURE 11** Standby/hold operation

namely "energy") reduction circuit is used to further minimise energy consumption in each R/W operation [\[11,](#page-11-0) 15, 16]. Referring to Figure 12, the adaptive voltage detector and passtransistor gate voltage boosting are the two sub‐circuits of the PDP reduction circuit [\[11\]](#page-11-0). For high-speed access operations, when boost select (BS) is set to high, the AVD circuit gives a boost enable (Boost\_EN). This changes the voltage supply of the cells to be accessed from  $V_{DD}$  to  $V'_{DD}$  (a voltage greater than  $V_{DD}$ ).

#### 2.6.1 | Adaptive voltage detector (AVD)

Referring to Figure 13, the adaptive voltage detector circuit used for generating the boost enable signal for the pass transistor voltage boosting circuit is presented. A common source amplifier (composed of  $M<sub>1301</sub>$ ,  $M<sub>1302</sub>$ , and  $R<sub>1301</sub>$ ) generates the VP0 signal, once the BS signal is enabled, which is then fed into the current‐starved inverter composed of M1304 and M1305, and has a precise switching voltage to adjust to slight variations in the BS signal. The inverter's output is then latched to keep track of whether the pass transistor gate boosting circuit has to be enabled or disabled. The Boost\_EN signal will be high if the output of inv $_{1303}$  and the latched voltage VP2 are both low.

#### 2.6.2 | Pass-transistor gate voltage boosting (PVB):

The PDP reduction circuit initially operates in the waiting mode. It will stay in the waiting mode, since the AVD circuit has not yet finished the system voltage detection (the inverter



**FIGURE 12** PDP reduction circuit



**FIGUR E 1 3** AVD circuit

WR EN

<span id="page-6-0"></span>switching voltage vs. VP0). The PDP reduction circuit enters the PDP reduction mode after exiting the waiting mode. The operation is as follows, as seen in Figure 14;  $C_{1401}$ 's top plate will be pulled down to the ground by  $inv_{1403}$ , while the bottom plate will be pulled up to VDD by  $M_{1401}$ . On the other hand, as soon as Boost\_EN is pulled up high, the PVB circuit starts to work. If WR\_EN is high, meaning one of the SRAM cells is being accessed (read or write), the PVB circuit enters the voltage boosting mode, turning  $M_{1401}$  off. Then the top plate of  $C_{1401}$  is pulled to a higher voltage through the pull-up circuit of inv<sub>1403</sub>. Now, the supply voltage  $V_{\text{DD}}$  is at a higher level than when the PDP is on the waiting mode,  $V_{DD} = V_{DD} -$ *V*<sub>DS1401</sub>. This now has a value  $V_{DD} = V_{DD} + \Delta V$ , which in this 40‐nm CMOS process is 1.0 V. The illustrative timing diagram of the PVB circuit is presented in Figure 15.

#### **2.7** | **Built‐in self‐test (BIST)**

In every memory system, as shown in Figure [1](#page-1-0), a BIST circuit is essential for high reliability. The BIST circuit block diagram is shown in Figure 16. It is composed of a pattern generator, a controller, and an output response analyser. The BIST circuit implements the control and output response pattern based on the March C‐algorithm [\[17](#page-12-0)], which has moderate complexity and fault coverage. It was tested for transition faults, addressdecoder faults, stuck‐at faults, coupling faults, data retention faults etc. The March C‐algorithm, which is the most reliable

M<sub>1401</sub>

inv<sub>1401</sub> inv<sub>1402</sub> WR ENK

WR END

VDD

WR **FN**  VDD

 $=$  VDD +  $\Delta$ 

WL b[31:0]

WA\_b[31:0]



**WR** EN

ببيها

 $inv<sub>140</sub>$ 

-WR ENb

inV<sub>1405</sub>

WL[31:0]

WA[31:0]



**FIGURE 15** Gate drive boosting timing diagram

of BIST circuits, has a complexity of  $10\times N$ , where N specifies the memory size, which in this study is 1024. The March C‐ algorithm, as shown in Eqn.  $(2)$ , is outlined as follows:

$$
\{\mathcal{D} (w0); \quad \mathcal{D} (r0, w1); \quad \mathcal{D}(r1, w0); \n\mathcal{D} (r0, w1); \quad \mathcal{D}(r1, w0); \quad \mathcal{D}(r0)\}\
$$
\n(2)

where *w* represents write access operation, *r* means read operation,  $\oint$  represents either up or down count,  $\oint$  represents down counts, and ⇑ represents counting up. The timing diagram for the BIST is shown in Figure 17; it has two testing modes, a normal testing mode and a retention testing mode. A linear feedback shift register (LFSR) pseudo‐random number generator is used to generate the test patterns with a characteristic equation as follows:

$$
f(x) = x^5 + x^4 + x^3 + x + 1 \tag{3}
$$

# **3** | **CHIP IMPLEMENTATION AND MEASUREMENT**

The prototype SRAM was implemented using a TSMC 40‐ nm CMOS process. Figure [18](#page-7-0) shows the layout of the entire system, with the floor‐plan at the right‐hand side, where the overall chip area is  $0.276$  mm<sup>2</sup> and the core area is



**FIGURE 17** BIST timing diagram

<span id="page-7-0"></span>

**FIGURE 1 8** Layout of the proposed 1‐kb SRAM

**8**

0.031 mm<sup>2</sup>. The validation of the design is carried out in the following sections: post-layout simulations and on-silicon measurements.

## **3.1** | **All‐PVT‐corner post‐layout simulations**

A post‐layout all process, voltage, and temperature (PVT) variations simulation is required before the system is fabricated on silicon. The corners used in the post‐layout simulations are as follows: 5 process variations (TT, SS, SF, FS, and FF), 3 supply voltage variations ( $0.9 \times V_{DD}$ ,  $V_{DD}$ , and  $1.1 \times V_{DD}$ ), and 3 temperature variations (0, 25, and 75) °C. Monte‐Carlo simulations were executed a total of 100 times. All tests performed showed the correct functionality of the system. It was also ensured that there is high speed access when the AVD circuit is activated.

Two of the most important measures for the quality of SRAM operations are the static and dynamic noise margins. The static noise margin of the proposed SRAM cell is assessed by turning off the write-assist loop and turning on  $M_{604}$ ,  $M_{605}$ in Figure [2](#page-2-0). Then, the voltage on BLB is changed from logic "0" (GND) to logic "1" (VDD). The curves of the induced voltages on Q versus Qb are plotted, where the biggest square in the generated transfer functions diagram is the SNM of this cell. Figure 19 shows the static noise margin of the cells showing the worst case of 412.3 mV. The static noise margin has an asymmetrical shape, different from traditional differential SRAM cells, since the design employed single-ended architecture.

For the dynamic noise margin (DNM) assessment, it can be found by using pulses with varying amplitude and pulse width applied at node WL to find out if the state of node Q is compromised. The dynamic noise margin of the designed SRAM cell is shown in Figure 20, which shows that VDD for the proposed SRAM cell to operate correctly can be as low as 0.3 V.

Figures 21 and [22](#page-8-0) show all the PVT (process, voltage, and temperature) corners simulations when the AVD is disabled and enabled. It can be seen that when the AVD circuit is enabled, the access speed of the SRAM has been improved.

Post‐layout simulation shows that the standby power goes down from 17.734 to 3.432  $\mu$ W, which is by 80.65% reduction.



**FIGURE 19** Static noise margin







**FIGUR E 2 1** Simulations with AVD circuit disabled

<span id="page-8-0"></span>

**FIGURE 2 2** Simulations with AVD circuit enabled



**FIGURE 2 3** Hold state transient simulation in the worst corner (SS corner,  $V_{DD} = 0.72$  V, T = 0 °C)

Referring to Figure 23, the post-layout simulation of the proposed design in the worst corner (SS corner, VDD  $= 0.72$  V, and  $T = 0$  °C) is presented. A data bit is written in the cell first, then the data is read, and afterwards, the cell is placed on its hold state. The simulation run shown is 800 ns with the cell in the hold state for over 710 ns. It can be seen that the data bit written in Qb remains at the same value after a long standby state; hence there is no droop in its content. It also shows the lowered voltage level for the cell during standby operation.

#### **3.2** | **On‐chip measurements**

The die photo of the proposed SRAM array (1 kb) is shown in Figure 24. The details are hard to be observed because the prototype is covered by metal layers due to the minimum



**FIGUR E 2 4** Die photo of our SRAM prototype

metal density rules required for the 40‐nm CMOS process. A total of 6 chips, measured 50 times each, were used for testing in the measurement site as shown in Figure [25,](#page-9-0) where this site is in the Tainan branch of TSRI (Taiwan Semiconductor Research Institute). The power supply is Agilent N6761A, the Agilent 81250 pattern generator is used as test vector generator, and the voltage measurements were taken using the Agilent 54855A oscilloscope. The Mmasurements show an improvement in the read delay of the system when boost select is enabled.

For a system clock of 2 MHz, the read delay improvement is from 229 to 64 ns. Figure  $26a$ , b show the read/write timing waveforms when  $BS = 0$  and  $BS = 1$ , respectively, at the system  $clock = 2 MHz$ .

Upon increasing the system clock to 10 MHz, the read delay is reduced from 148 to 61 ns. Figure  $27a$ , b show the read/write timing waveforms when  $BS = 0$  and  $BS = 1$ , respectively, at the system clock  $= 10$  MHz. The proposed AVD produced a huge reduction in the access delay as expected. It was measured that the write‐0 delay is 0.250 ns, and the write‐1 delay is 0.233 ns. The overall power consumption is found to be 0.8 V  $\times$  30  $\mu$ A = 24.0  $\mu$ W.

Table [2](#page-10-0) tabulates several previous low power/energy SRAM designs using 28, 40, and 65‐nm CMOS technology nodes in the past years. The energy/access is defined as the average energy dissipated while executing write‐0‐read‐0 and write-1-read-1 operations divided by the system clock rate. The energy/bit, on the other hand, is defined as the energy/access divided by the number of bits per word. The proposed SRAM attained the second lowest energy per access with a value of 32 fJ, second only to 20.6 fJ [[18\]](#page-12-0).

It does, however, demonstrate the lowest energy per bit among all SRAMs designed and implemented using the 40‐nm CMOS technology and measured on silicon. It even has a lower energy per bit compared to one work implemented in more advanced CMOS technology, that is, 28‐nm CMOS technology [\[18\]](#page-12-0). Referring to Figure [28](#page-10-0), the SNMs of the SRAMs in the previous decade are compared. The graph is normalised to the respective voltage supply. As observed, the proposed SRAM's SNM is the closest to 50% of the supply voltage, resulting in good noise immunity.

These observations validate the fact that the proposed power‐gating PMOS devices indeed account for the reduction of the standby power for those cells that are not activated.

<span id="page-9-0"></span>

**FIGURE 2 5** Measurement setup for the SRAM prototype



**FIGURE 26** Access operation at (a) 2 MHz (BS = 0) and (b) 2 MHz  $(BS = 1)$ 

Referring to the technology roadmap shown in Figure [29,](#page-11-0) the proposed SRAM achieved the historical second lowest energy per bit in the last decade compared with previous works. If the CMOS processes normalise the energy per bit, namely 40‐nm (ours) versus 28‐nm [[18](#page-12-0)], as well as the PDP reduction, the proposed SRAM is in fact the historical lowest one. The major reason is the addition of the power gating circuit in the singleended cell wherein the standby power is significantly reduced. Meanwhile, the proposed AVD circuit manages to compensate for the access speed loss by generating a higher supply voltage





**FIGURE 27** Access operation at (a) 10 MHz (BS = 0) and (b) 10 MHz ( $BS = 1$ )

when the access operations are asserted. All of the measurement results substantially prove the energy efficiency of the proposed design.

# **4** | **CONCLUDING REMARKS**

This investigation demonstrates a very low‐power SRAM architecture on silicon that has power supply gating that responds to the cell operations. The supply voltage is kept at a lower level for SRAM cells that are not being accessed, which in turn creates a substantial decrease in standby power. Aside from the supply voltage gating, a power delay product reduction circuit is added to the design to further reduce the power dissipation by decreasing the transient time of states. On the other hand, this extra circuit elevates the supply of the read/write circuit to a higher level when the read/write circuit is accessed, in which the delay is significantly reduced. Post‐layout simulations verify the ultra‐low‐power performance, and the physical measurement also showed the expected low power/energy performance. The same design

<span id="page-10-0"></span>**TABLE 2** Performance comparison of current state‐of‐the‐art low‐energy/power SRAM designs

| Year                      | <b>VLSIC 2011 [19]</b>        | <b>ISQED 2012</b><br>$[14]$ | TCAS-I 2014 [11] CICC 2015 [20] |                              | <b>TVLSI 2016 [16]</b>        | <b>TVLSI 2017 [21]</b>        | <b>TCAS-I 2017</b><br>$[22]$         |
|---------------------------|-------------------------------|-----------------------------|---------------------------------|------------------------------|-------------------------------|-------------------------------|--------------------------------------|
| CMOS Tech. (nm)           | 40                            | 40                          | 40                              | 28                           | 40                            | 65                            | 65                                   |
| Cell                      | $8T$                          | $8T$                        | 12T                             | $8T$                         | 5T                            | 6T                            | 9Τ                                   |
| Supply volt. (V)          | 0.5                           | 0.6                         | 0.35                            | $0.7\,$                      | $0.8\,$                       | $1.2\,$                       | 0.35                                 |
| Verification <sup>a</sup> | Meas.                         | Meas.                       | Meas.                           | Meas.                        | Meas.                         | Simu.                         | Meas.                                |
| SNM (mV)                  | N/A                           | 86                          | 119                             | 171                          | 353                           | N/A                           | N/A                                  |
| Read PDP (f)              | 88                            | N/A                         | N/A                             | 650                          | N/A                           | 17.5                          | N/A                                  |
| Capacity (kb)             | 512                           | $4 + 1$                     | $\overline{4}$                  | 64                           | $4 + 1$                       | $\mathbf{1}$                  | $\overline{4}$                       |
| Word length               | 16                            | 16                          | 16                              | 16                           | 5                             | 32                            | 64                                   |
| Frequency (MHz)           | 6.25                          | 10                          | 11.5                            | 50                           | 54                            | 100                           | 0.741                                |
| Energy/access $(p)$       | 8.8                           | 2.24                        | 1.91                            | 0.65                         | 0.941                         | 2.2                           | 0.229                                |
| Energy/bit (fJ)           | 550                           | 140                         | 119.4                           | 40.625                       | 188.22                        | 68.75                         | 3.58                                 |
| Core area $\text{(mm}^2)$ | 0.73                          | 1.278                       | 0.018                           | 0.73                         | 0.024                         | 0.013                         | 0.011                                |
| Year                      | <b>TCAS-II 2018</b><br>$[23]$ | <b>JCSC 2019 [18]</b>       | <b>TCAS-II 2020</b><br>$[24]$   | <b>TCAS-I 2021</b><br>$[25]$ | <b>TCAS-II 2021</b><br>$[26]$ | <b>TCAS-II 2021</b><br>$[27]$ | This work 2022                       |
| CMOS Tech. (nm)           | 65                            | 28                          | 65                              | 55                           | 40                            | 16                            | 40                                   |
| Cell                      | 6T                            | 6T                          | $8T$                            | 6T                           | 6T                            | 6T                            | 6T                                   |
| Supply volt. (V)          | 1.2                           | $0.8\,$                     | 0.36                            | 1.2                          | 0.9                           | $\rm 0.8$                     | $\rm 0.8$                            |
| Verification <sup>a</sup> | Simu.                         | Meas.                       | Meas.                           | Meas.                        | Meas.                         | Meas.                         | Meas.                                |
| SNM (mV)                  | N/A                           | 292                         | 190                             | N/A                          | 377                           | 504.76                        | 412.3                                |
| Read PDP (f)              | N/A                           | 444.5                       | 4454.4                          | 233.38                       | 47.382                        | 2.69                          | 2.0592                               |
| Capacity (kb)             | $\,$ 8 $\,$                   | $1 + 1$                     | 32                              | $\overline{4}$               | $\mathbf{1}$                  | $\mathbf{1}$                  | $\mathbf{1}$                         |
| Word length               | 32                            | 32                          | 128                             | 32                           | 32                            | 32                            | $32\,$                               |
| Frequency (MHz)           | 20                            | 40                          | 0.25                            | 935                          | 200                           | 500                           | $10$ (typ.)                          |
|                           |                               |                             |                                 |                              |                               |                               | 15 (max.)                            |
| Energy/access $(p)$       | 0.592                         | 0.026                       | 0.3                             | 1.04                         | 0.2313                        | 0.219                         | 0.032                                |
| Energy/bit (f)            | 18.5                          | 0.6                         | 2.34                            | 32.5                         | 7.23                          | 6.8                           | $1.0 \; (\partial) 10 \; \text{MHz}$ |
|                           |                               |                             |                                 |                              |                               |                               | 4.3 $@2$ MHz                         |
| Core area $(mm2)$         | 0.019                         | 0.025                       | 0.015                           | 0.018                        | 0.01                          | 0.02                          | 0.02                                 |

a Simu. ‐ Simulations or Meas. ‐ Measurements on‐chip.



**FIGURE 2 8** Comparison of SNMs for SRAMs (normalised to V*DD*)

methodology is expected to be applied in more advanced SRAM technology nodes, for example, 22, or even 16‐nm FinFET nodes.

# **AUTHOR CONTRIBUTION**

Chua‐Chin Wang: Funding acquisition, Visualisation, Formal analysis, Investigation, Methodology, Writing – review & editing. Ralph Gerard B. Sangalang: Formal analysis, Investigation, Methodology, Writing – review & editing. I‐Ting Tseng: Conceptualisation, Methodology, Software, Validation. Yi‐Jen Chiu: Funding acquisition, Visualisation. Yu‐Cheng Lin: Funding acquisition, Visualisation, co-corresponding. Oliver Lexter July A. Jose: Resources, Writing – review & editing.

<span id="page-11-0"></span>

**FIGURE 2 9** Technology roadmap of energy per bit for recent SRAMs

#### **ACKNOWLEDGEMENT**

The National Science and Technology Council of Taiwan funded this study in part via grants MOST109‐2218‐E‐110‐ 007, 108‐2218‐E‐110‐011, 108‐2218‐E‐110‐002, 107‐2218‐E‐ 110‐002, and 110‐2221‐E‐110‐063‐MY2. The authors would like to express their profound appreciation to TSRI (Taiwan Semiconductor Research Institute) in NARL (National Applied Research Laboratories), Taiwan, for providing EDA tool support, fabrication service, and measurement setup.

#### **CONFLICT OF INTEREST STATEMENT** None.

#### **DATA AVAILABILITY STATEMENT**

The data that support the findings of this study are available from the corresponding author upon reasonable request

# **PERMISSION TO REPRODUCE MATERIALS FROM OTHER SOURCES**

None.

## **ORCID**

*Chua‐Chin Wang* <https://orcid.org/0000-0002-2426-2879> *Ralph Gerard B. Sangalang* **D** [https://orcid.org/0000-0002-](https://orcid.org/0000-0002-4120-382X) [4120-382X](https://orcid.org/0000-0002-4120-382X)

#### **REFERENCES**

- 1. Morifuji, E., et al.: Supply and threshold‐voltage trends for scaled logic and SRAM MOSFETs. IEEE Trans. Electron. Dev. 53(6), 1427–1432 (2006). <https://doi.org/10.1109/TED.2006.874752>
- 2. Xu, H., et al.: A current mode sense amplifier with self-compensation circuit for SRAM application. In: Proceedings of the 2013 IEEE 10th International Conference on ASIC; 2013 Oct 28–31; Shenzhen, China, pp. 1–4. IEEE, New York (2013). [https://doi.org/10.1109/ASICON.](https://doi.org/10.1109/ASICON.2013.6812020) [2013.6812020](https://doi.org/10.1109/ASICON.2013.6812020)
- 3. Do, A.‐T., et al.: Design and sensitivity analysis of a new current‐mode sense amplifier for low-power SRAM. IEEE Trans. VLSI Syst. 19(2), 196–204 (2011). <https://doi.org/10.1109/TVLSI.2009.2033110>
- 4. Kim, D., et al.: A 1.85fW/bit ultra low leakage 10T SRAM with speed compensation scheme. In: Proceedings of the 2011 IEEE International Symposium of Circuits and Systems (ISCAS); 2011 May 15– 18; Rio de Janeiro, Brazil, pp. 69–72. IEEE, New York (2011). [https://doi.org/10.](https://doi.org/10.1109/ISCAS.2011.5937503) [1109/ISCAS.2011.5937503](https://doi.org/10.1109/ISCAS.2011.5937503)
- 5. Ruixing, L., et al.: Bitline leakage current com-pensation circuit for highperformance SRAM design. In: Proceedings of the 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage; 2012 Jun 28–30; Xiamen, China, pp. 109–113. IEEE, New York (2012). <https://doi.org/10.1109/NAS.2012.19>
- 6. Agawa, K., et al.: A bitline leakage compensation scheme for low‐voltage SRAMs. IEEE J. Solid State Circ. 36(5), 726–734 (2001). [https://doi.](https://doi.org/10.1109/4.918909) [org/10.1109/4.918909](https://doi.org/10.1109/4.918909)
- 7. Wang, C.‐C., et al.: 4‐kB 500‐MHz 4‐T CMOS SRAM using low‐VTHN bitline drivers and high‐VTHP latches. IEEE Trans. VLSI Syst. 12(9), 901–909 (2004). <https://doi.org/10.1109/TVLSI.2004.833669>
- 8. Wang, C.‐C., Lee, C.‐L., Lin, W.‐J.: A 4‐kB low‐power SRAM design with negative word‐line scheme. IEEE Trans. Circuits‐I 54(5), 1069–1076 (2007). <https://doi.org/10.1109/TCSI.2006.888767>
- Wang, D.-S., Su, Y.-H., Wang, C.-C.: A readout circuit with cell output slew rate compensation for 5T single-ended 28 nm CMOS SRAM. Microelecton J. 70, 107–116 (2017). [https://doi.org/10.1016/j.mejo.](https://doi.org/10.1016/j.mejo.2017.11.001) [2017.11.001](https://doi.org/10.1016/j.mejo.2017.11.001)
- 10. Chen, S.‐Y., Wang, C.‐C.: Single‐ended disturb‐free 5T loadless SRAM cell using 90 nm CMOS process. In: Proceedings of the 2012 IEEE International Conference on IC Design and Technology; 2012 May 30– Jun 1; Austin, TX, USA, pp. 1–4. IEEE, New York (2012). [https://doi.](https://doi.org/10.1109/ICICDT.2012.6232848) [org/10.1109/ICICDT.2012.6232848](https://doi.org/10.1109/ICICDT.2012.6232848)
- 11. Chiu, Y.‐W., et al.: 40nm bit‐interleaving 12T subthreshold SRAM with data‐aware write‐assist. IEEE Trans. Circuits‐I 61(9), 2578–2585 (2014). <https://doi.org/10.1109/TCSI.2014.2332267>
- 12. Reddy, S., Sangalang, R.G.B., Wang, C.‐C.: Sub‐0.2 pJ/access Schmitt trigger‐based 1‐kb 8T SRAM implemented using 40‐nm CMOS process. In: Proceedings of the 2022 International Conference on IC Design and Technology (ICICDT); 2022 Sep 21–23; Hanoi, Vietnam, pp. 24–27. IEEE, New York (2022). [https://doi.org/10.1109/ICICDT56182.2022.](https://doi.org/10.1109/ICICDT56182.2022.9933116) [9933116](https://doi.org/10.1109/ICICDT56182.2022.9933116)
- 13. Frustaci, F., et al.: Techniques for leakage energy reduction in deep submicrometer cache memories. IEEE Trans. VLSI Syst. 14(11), 1238–1249 (2006). <https://doi.org/10.1109/TVLSI.2006.886397>
- 14. Terada, M., et al.: A 40-nm 256-kb 0.6-V operation half-select resilient 8T SRAM with sequential writing technique enabling 367‐mV VDDmin reduction. In: Proceedings of the 13th International Symposium on Quality Electronic Design (ISQED); 2012 Mar 19–21; Santa Clara, CA,

1/3 RSW, 0,1 Download program comparts and the Day of the Day of the Day of the Secondal Compart School of the Secondal Comparts and Conducts with your compared with your property on With Apple 2012 1/3 Apple 2012 1/3 Appl 1318598,0, Downloaked from hale including the complete 2.1241 by National Sun Market Market Direct Discussions (III 01/2023). See the Terms and Conditions joiling Library with your let 101/2023]. See the Terms and Conditio <span id="page-12-0"></span>USA, pp. 489–492. IEEE, New York (2012). [https://doi.org/10.1109/](https://doi.org/10.1109/ISQED.2012.6187538) [ISQED.2012.6187538](https://doi.org/10.1109/ISQED.2012.6187538)

- 15. Wang, C.‐C., Tseng, I.‐T.: Ultra low power single‐ended 6T SRAM using 40 nm CMOS technology. In: Proceedings of the 2019 International Conference on IC Design and Technology (ICICDT); 2019 Jun 17–19; Suzhou, China, pp. 1–4. IEEE, New York (2019). [https://doi.org/10.](https://doi.org/10.1109/ICICDT.2019.8790848) [1109/ICICDT.2019.8790848](https://doi.org/10.1109/ICICDT.2019.8790848)
- 16. Wang, C.-C., et al.: A leakage compensation design for low supply voltage SRAM. IEEE Trans. VLSI Syst. 24(5), 1761–1769 (2016). [https://doi.](https://doi.org/10.1109/TVLSI.2015.2484386) [org/10.1109/TVLSI.2015.2484386](https://doi.org/10.1109/TVLSI.2015.2484386)
- 17. Al‐Harbi, S.M., Gupta, S.K.: An efficient methodology for generating optimal and uniform march tests. In: Proceedings of the 19th IEEE VLSI Test Symposium (VTS 2001); 2001 Apr 29– May 3; Marina Del Rey, CA, USA, pp. 231–237. IEEE, New York (2001). [https://doi.org/](https://doi.org/10.1109/VTS.2001.923444) [10.1109/VTS.2001.923444](https://doi.org/10.1109/VTS.2001.923444)
- 18. Wang, C.‐C., et al.: A single‐ended 28‐nm CMOS 6T SRAM design with read‐assist path and PDP reduction circuitry. J. Circ. Syst. Comput. 29(6), 2050095 (2020). <https://doi.org/10.1142/S0218126620500954>
- 19. Yoshimoto, S., et al.: A 40‐nm 0.5‐V 20.1‐*µ*W/MHz 8T SRAM with low‐ energy disturb mitigation scheme. In: Proceedings of the 2011 Symposium on VLSI Circuits - Digest of Technical Papers; 2011 Jun 15-17; Kyoto, Japan, pp. 72–73. IEEE, New York (2011). [https://ieeexplore.](https://ieeexplore.ieee.org/document/5986220) [ieee.org/document/5986220](https://ieeexplore.ieee.org/document/5986220)
- 20. Mori, H., et al.: A 298‐fJ/writecycle 650‐fJ/readcycle 8T three‐port SRAM in 28-nm FD-SOI process technology for image processor. In: Proceedings of the 2015 IEEE Custom Integrated Circuits Conference (CICC); 2015 Sep 28–30; San Jose, CA, USA, pp. 1–4. IEEE, New York (2015). <https://doi.org/10.1109/CICC.2015.7338360>
- 21. Lee, J., et al.: A 17.5-fJ/bit energy-efficient analog SRAM for mixedsignal processing. IEEE Trans. VLSI Syst. 25(10), 2714–2723 (2017). <https://doi.org/10.1109/TVLSI.2017.2664069>
- 22. Shin, K., Choi, W., Park, J.: Half‐select free and bit‐line sharing 9T SRAM for reliable supply voltage scaling. IEEE Trans Circuits‐I. 64(8), 2036–2048 (2017). <https://doi.org/10.1109/TCSI.2017.2691354>
- 23. Surana, N., Mekie, J.: Energy efficient single‐ended 6‐T SRAM for multimedia applications. IEEE Trans. Circuits‐I 66(6), 1023–1027 (2019). <https://doi.org/10.1109/TCSII.2018.2869945>
- 24. Do, A.-T., Zeinolabedin, S.M.A., Kim, T.T.-H.: Energy-efficient dataaware SRAM design utilizing column‐based data encoding. IEEE Trans. Circuits‐II 67(10), 2154–2158 (2020). [https://doi.org/10.1109/TCSII.](https://doi.org/10.1109/TCSII.2019.2958668) [2019.2958668](https://doi.org/10.1109/TCSII.2019.2958668)
- 25. Chen, J., et al.: Analysis and optimization strategies toward reliable and high-speed 6T compute SRAM. IEEE Trans. Circuits-I 68(4), 1520-1531 (2021). <https://doi.org/10.1109/TCSI.2021.3054972>
- 26. Wang, C.‐C., Kuo, C.‐P.: 200‐MHz single‐ended 6T 1‐kb SRAM with 0.2313 pJ energy/access using 40‐nm CMOS logic process. IEEE Trans. Circuits‐II 68(9), 3163–3166 (2021). [https://doi.org/10.1109/TCSII.](https://doi.org/10.1109/TCSII.2021.3091973) [2021.3091973](https://doi.org/10.1109/TCSII.2021.3091973)
- 27. Wang, C.‐C., Sangalang, R.G.B., Tseng, I.‐T.: A single‐ended low power 16‐nm FinFET 6T SRAM design with PDP reduction circuit. IEEE Trans. Circuits‐II 68(12), 3478–3482 (2021). [https://doi.org/10.1109/](https://doi.org/10.1109/TCSII.2021.3123676) [TCSII.2021.3123676](https://doi.org/10.1109/TCSII.2021.3123676)

**How to cite this article:** Wang, C.‐C., et al.: A 1.0 fJ energy/bit single‐ended 1 kb 6T SRAM implemented using 40 nm CMOS process. IET Circuits Devices Syst. 1–13 (2023). <https://doi.org/10.1049/cds2.12141>