Fully-integrated photonic tensor core for image convolutions

Convolutions are one of the most critical signal and image processing operations. From spectral analysis to computer vision, convolutional filtering is often related to spatial information processing involving neighbourhood operations. As convolution operations are based around the product of two functions, vectors or matrices, dot products play a key role in the performance of such operations; for example, advanced image processing techniques require fast, dense matrix multiplications that typically take more than 90% of the computational capacity dedicated to solving convolutional neural networks. Silicon photonics has been demonstrated to be an ideal candidate to accelerate information processing involving parallel matrix multiplications. In this work, we experimentally demonstrate a multiwavelength approach with fully integrated modulators, tunable filters as microring resonator weight banks, and a balanced detector to perform matrix multiplications for image convolution operations. We develop a scattering matrix model that matches the experiment to simulate large-scale versions of these photonic systems with which we predict performance and physical constraints, including inter-channel cross-talk and bit resolution.


Introduction
Application-specific integrated circuits (ASICs) for convolutions have been intensively investigated in the past few years due to their significant benefits: parallel and high-speed processing and small footprint [1]. Experimental work on analog photonic ASICs has been successfully demonstrated to perform on-chip convolutions. In recent years, integrated photonics has enabled parallel processing for compact, ultrafast convolution operations in perceptron-based networks [2][3][4][5][6]. Photonic networks implemented with discrete devices have also been shown to achieve high performance for image processing [7][8][9][10]. Due to the reconfigurability of photonic components, convolutional kernels used in those circuits can be re-sized to accommodate tasks that require convolutions, as well as any neural network model based on them.
The high demand for hardware accelerators that can perform convolutions is driving important technological Nanotechnology Nanotechnology 34 (2023) 395201 (11pp) https://doi.org/10.1088/1361-6528/acde83 Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. advances in integrated photonics. For embedded systems, incorporating hardware accelerators with a small footprint and low power consumption is the main driving factor [11][12][13]. For instance, augmented reality glasses can use an embedded processor that performs convolutions efficiently to remove blurring or compensate for motion artifacts. Complex convolutional operations would be implemented through convolutional neural networks (CNNs), whose algorithms have been broken down into physical models and implemented by ASICs. Convolutional operations used in CNNs are computationally expensive since they encompass around 99% of the computational capacity to solve most convolution-based deep learning models [14].
By exploiting the parallelism and high bandwidth offered by light, a decrease in latency and an increase in compute density to perform convolutions is expected [15]. These features can be obtained by specifying a suitable architecture that efficiently performs matrix-vector multiplications (MVMs). Algorithms specially designed for parallel hardware platforms such as cuDNN [16] have been proven to work in multiwavelength photonics [2,17,18]. This approach represents operations between images and kernels as the dot product of two vectorized matrices. The parallel features of incoherent photonic platforms allow for very compact on-chip designs that are suitable for both the mobile industry and the embedded system market, which usually require processors with a small footprint. In addition, the reprogrammable capabilities of tunable silicon microring resonators (MRR) weight banks allow for great versatility when performing onchip MVMs [15,[19][20][21][22]. In particular, the digital electronics and analog photonics (DEAP) architecture introduced in [18] has the potential to become a highly efficient, parallel, compact solution for convolutions and MVMs in general.
In this work, we present for the first time an experimental demonstration of a fully-integrated photonic tensor core able to perform on-chip MRR-based multiwavelength image convolutions. Our demonstration includes on-chip integration of input and weights with MRRs to perform element-wise multiplications and a balanced photodetector (BPD) to perform on-chip summations as shown by a general parallel (D × M) schematic illustration for arbitrary kernel size (R × R = M) and number input channels (D) in figure 1. We show how to perform on-chip convolutions with 2 × 2 kernel sizes. The image and kernel pixels are encoded as MRRs transmission values through voltages applied to in-resonators photoconductive heaters (IRPH) [23,24] embedded in MRRs. The IRPH-MRRs allow simultaneous weight tuning, monitoring, and stabilization. As discussed in [18] this architecture can produce convolved pixels on a hundred picosecond timescales with PN-junction-based microring modulators (MRMs) based on carrier depletion effects (instead of thermo-optic effect as demonstrated here). To explore the limitations of this tensor core for larger kernels, we introduce a complete theoretical model based on S-matrix theory to model the physical properties of MRRs and their interactions. This simulator allows for the study of operation conditions of the DEAP architecture, which includes the number of wavelengths within a free spectral range (FSR), bits of resolution, encoding protocols, and inter-channel cross-talk (which was not accounted for in the original simulation described in [18]). We will show how to program DEAP cores made of IRPH-MRRs with techniques that can be extrapolated to architectures with other MRRs and active elements. Experimental results will be contrasted with S-matrix simulations for validation purposes.

A 4-DEAP core architecture implemented on chip
figure 2(a) shows a 4-DEAP core circuit comprising a set of four cascaded all-pass MRRs for encoding inputs and another set of four cascaded add-drop MRRs encoding weights/ kernels. This tensor core is fabricated on a silicon-oninsulator (SOI) platform. The pairs of input and weight MRRs on the same resonance wavelength perform parallel elementwise multiplications that are summed up by a BPD connected to the thru and drop ports of the weight MRRs, implementing positive and negative multiplicative weights. The tensor core is implemented on a standard SOI platform with a silicon thickness of 220 nm, a buried oxide thickness of 2 μm, and waveguides 500 nm wide. Each MRR has an embedded N-doped heater (also IRPH), designed with two N++ doped layers on the side of a silicon waveguide on top of SiO 2 [22][23][24]. On top of the N-doped layers, aluminum contacts through which the refractive index of the waveguide is thermally tuned. Each set of MRRs (inputs and weights separated) has a set of radii defined to avoid resonance overlap. The MRRs are physically spaced 50 μm from each other on the chip. Figure 2(b) shows a schematic of the experimental setup, which contains four tunable lasers (Pure Photonics PPCL600 micro-ITLA) used to feed the circuit through polarization controllers, a 4 × 1 multiplexer (MUX), and an erbium-doped fibre amplifier that compensates for the coupling loses to the chip. A rack of source measuring units (SMU; Keithley 2606B) composed of eight channels drives the heaters on the MRRs. Another three SMUs (Keithley 2400) are used to bias the on-chip BPD (which has a responsivity of 1 A/W) and measure photocurrents at the drop, thru, and subtraction (sub) ports.
To obtain the spectrum of the MRRs at the thru and the drop ports simultaneously, the wavelength of one (out of the four) of the tunable lasers is swept and detected at the respective ports (thru and drop) of the BPD. For this sweep, the MRRs are not actuated with any heater currents. The spectrum can be seen as a set of solid curves in figure 2(c). At the drop port, four resonances are obtained corresponding to the four weights. At the thru port, we count eight resonances corresponding to the thru ports of the inputs and the weights. The resonances are at different wavelengths since the radii of all MRRs are slightly different by default. This weight bank's estimated average experimental parameters consist of linewidth = 0.30 nm, FSR = 12.32 nm, finesse = 39.92 and quality factor = 5034.47.
As observed, the initial spectra of the four input MRRs are slightly off-resonance with their constituent set of weight MRRs which are on-resonance with wavelengths {λ i } (with i = 1, 2, 3, 4). Alignment can be achieved by applying a voltage to the pair of MRRs until the eight peaks displayed by the thru port curve become four. Figure 2(d) shows the IV curves of the photodetectors contained in the BPD. Here, we realize that voltages {−1, 1} V are ideal to bias thru and drop photodetectors, respectively.
Thermal fluctuations can negatively impact the performance of our circuit. Since we use MRRs to perform matrixto-vector multiplications, significant thermal noise can take the MRRs off-resonance, leading to significant errors in the circuit's operations. To avoid these scenarios, we added a thermal controller to the DEAP circuit that compensates for slow varying thermal fluctuations. For more accurate thermal control, a feedback control system can be connected to the N-doped heater that the MRR has, as shown in [23][24][25]. Feedback control circuits are typically implemented using CMOS circuits on monolithic or heterogeneous (flip-chip) integration with photonics for large-scale circuit versions. The thermal tunning efficiency of MRRs with N-doped heaters is 0.25 nm mW -1 . So to tune an MRR on and off-resonance, the power consumption per MRR is 1 mW. The BPD's power consumption is 1 μW-which is negligible compared to the scale of the heaters' consumption. Since our circuit comprises eight MRRs, the power consumption required to implement convolution operations is in the range of (0, 8) mW because of the thermo-optic effect. Each MRM will consume 0.05 mW and below using CMOS-compatible MRMs with PN junctions. The power efficiency for the thermally tuned system is around 200 KMAC/s/ 8 mJ/s = 25 μJ/s/OP. And then, with today's technology, i.e. MRMs with PN junctions, we can get a power efficiency of up to 200 GMAC/s/ 0.05 mJ/s = 4 fJ/OP.
Our photonics circuit is a low-speed proof-of-concept describing all details that would lead to eventual high-speed applications. The modulation speed of the N-doped heater is around 50 Kbps. CMOS-compatible MRMs typically get to >50 Gbps using standard reverse bias PN junctions. Using the protocol described in this manuscript, we can implement high-speed operations in a DEAP core with MRMs with PN junctions [26]. The number of MAC operations per second is 4 MAC/20 μs = 200 KMAC/s for the thermally-tuned DEAP core presented in this work. Using a MRM with a modulation speed of, e.g. 50 Gbps, the number of MAC operations per second is 4 MAC/20ps = 200 GMAC/swhich will be available in future generations. In addition, as demonstrated in [18], a DEAP core that uses MRMs with modulation speeds of 128 Gbps [26] can perform convolutions between 1.4 and 7.0 times faster than the mean GPU runtime while using 63% less energy. Based on the vectorized format [16] we are using to implement convolutions, a DEAP core can achieve remarkable energy savings and convolutional runtime compared to DeepBench benchmarks.

Pipeline for photonic convolutions
In this section, we show how DEAP can perform simple image convolutions. We compare the experimental results with S-matrix-based simulations. For a given image  and kernel  with dimensionality Z × Z and R × R, respectively, the output of a convolved pixel is given by where the input image is symmetric, and the stride parameter is equal to one. The expression for the general case can be found in [18]. Based on the cuDNN algorithm for parallel hardware platforms [16], we represent operations between images and kernels as dot products of two vectorized matrices. This method allows us to use the same DEAP architecture to carry out the full convolution by performing dot products of two vectors. The output of such an operation is constructed using matrix slicing notation on the image matrix In practice, the set of operations to be done on the image matrix can be collected in a new matrix  new with dimensions ( ) -+Ź R R 1 2 2 , and the kernel is defined as a vector with R 2 elements  new . Therefore, each row of  new is multiplied by the kernel  new given that each of them is composed of R 2 elements. Such a process can be performed either in parallel or sequentially. The fully parallel approach would require ( ) -+ Z R 1 2 cells made of R 2 -DEAP cores, where each core will implement the multiplication of a given row of  new by the vector which defines the kernel  new . The approach uses one cell sequentially made of R 2 -DEAP cores for a total number of ( ) -+ Z R 1 2 iterations. In each iteration, the inputs to the DEAP architecture need to be updated to implement every row of  new . Here the wavelength of a Pure Photonics PPCL600 micro-ITLA tunable laser was swept, and the monitored photocurrent was gathered at the BPD to get the solid curves. (d) Balanced photodetectors IV curve.

Experimental image convolutions
For DEAP architectures, there are limits to the amount of information that can be represented for either convolutions or MVMs. The first limitation is related to the FSR of the MRRs as it will impose a constraint on the number of channels that can be cascaded in series due to the optical cross-talk between these components. Since our encoding methods are indexbased, tuning MRRs on and off-resonance to implement numbers in the interval (0, 1) requires the shift of the resonance condition to the right from minimum to maximum transmission. Such an encoding protocol defines a one-to-one mapping between a given digital number to be represented in analog, the refractive index of the MRR, and the transmission through it. Each number has a one-to-one correspondence with a given refractive index of the MRR and, consequently, with a given transmission value in the optical domain [27].
A given DEAP architecture is limited in size by the physical properties of the MRRs that compose it. The Finesse is estimated to find the absolute theoretical maximum of wavelength-division multiplexed channels that can be defined within an FSR. The Finesse can be calculated as the ratio of FSR and resonance width (also known as the full width at half maximum (FWHM)), Finesse = FSR/FWHM. Once the number of channels N has been defined, we distribute these channels evenly within an FSR. For this to happen, we use a method that consists of tuning the refractive index of MRRs (that have the same properties) and placing their transmission profiles at different locations in the optical spectrum. MRRs must be designed with the same properties in both cases, making them theoretically identical. This fact means that resonance overlap is expected in the case of identical MRRs. In practice, this condition cannot be fulfilled because of fabrication variabilities. To account for such variabilities, the refractive index of the MRRs can be tuned to achieve the resonance overlap condition.
DEAP cores can be used to perform analog convolutions and MVMs in general. For this, each device must be tuned to implement such operations. There are two ways of tuning MRRs to represent numbers in the analog domain: (i) absorption [27] and (ii) index methods [22,24,28]. Absorptive tuning consists of changing the critical coupling condition of the MRR by controlling its absorption, resulting in modulating the transmission of light through the MRR. On the other hand, index methods involve tuning MRRs on-and off-resonance by modifying the refractive index n eff leading to changes in their transmission profiles. Here, we employ index tuning to encode information to be processed by DEAP, and we evaluate the number of DEAP cores we can use before optical cross-talk appears.
Let us start with a simple example to calibrate our experiment performed with our on-chip DEAP system. Given that our experimental setup comprises eight MRRs (four to implement inputs and four to implement kernel values), the task to be solved is based on image processing using a 2 × 2 kernel. For a proof of concept, we will use an 8 × 8 gray-level input image (see figure 3(a)) so that the number of iterations required to perform the convolutions is kept low. A total amount of 49 iterations must be run to process all 4-elements vectors that compose the 49 × 4 image matrix  new shown by figure 3(b). The proposed task for this proof of concept is Gaussian blur. Therefore, a 2 × 2 kernel (see figure 3(c)) needs to be represented as vectors  new with four elements to program the 4-DEAP cores.
To encode information for the convolution, we created look-up tables following the protocols stated in the previous section for both the experimental setup and the S-matrix simulation. The experimental look-up tables were built up with voltage values in (0, 2.5) V without any inter-channel crosstalk experienced in the current experiment. After this step, the digital data from the image and kernel is mapped to refractive index values for the simulation and voltage values for the experiment. Although the input image has 8 × 8 pixels, there are sub-sets of pixels with the same intensities. In this case, the entire image can be modeled with only seven intensity values and one-to-one correspondence with transmission values in the optical domain. Since the convolution task requires encoding up to seven values, the resolution required from our devices is only 2.8 ( ( ) log 7 2 ) bits. Likewise, the 2 × 2 kernel has different sets of pixels whose intensities are repeated. Therefore the resolution required from our devices is only 1.58 ( ( ) log 3 2 ) bits. The resolution of the chip is led by the resolution of the MRRs, as this is the only way information can get encoded to the optical domain. The bit-resolution of the MRRs used in this circuit matched the 4 bits of resolution published in [29]. Keithley's voltage resolution is 14-bit. Therefore, it does not pose a bottleneck for the system. The results of this experiment are shown in figure 3(d), where the original image has been successfully transformed into a blurry picture.
The described methodology to run image convolutions experimentally is used to perform larger convolutions in practical applications. In this work, we focus on edge detection and color inversion tasks. Figure  The difference in performance between both tasks is in the implementation of the filter. To perform color inversion tasks, a 2 × 2 kernel with all elements equal to zero except for one negative element can be used to invert colors in black and white pictures. For edge detection, a larger kernel (e.g. 3 × 3) usually performs better. Larger kernels such as 3 × 3 or 5 × 5 are ideal for more complex convolution operations and CNNs [18]. Even though kernel sizes are usually empirically found, kernels similar to or smaller than 3 × 3 are picked since it takes less time to train CNNs based on them. In contrast, larger kernels can lead to more significant generalization errors since it increases the model complexity by incrementing more trainable parameters [30]. Typically, oddsize filters are preferred as they maintain the model's symmetry. For deep CNNs, even-size filters such as 2 × 2 are avoided due to the inability to interpolate the center value of the grid [31]. More MRRs can be added to the DEAP architecture for deep learning tasks. The flexibility of the DEAP circuit allows for implementing the kernel size necessary to enable more complex applications.

The S-matrix model
Reflective and transmission properties of the DEAP architecture can be explicitly modeled through the S-matrix theory. Here, we derive analytical expressions for the interaction between the incident, reflected, and transmitted electric fields at the ports of coupled MRRs by following the geometry of the devices.   , the power attenuation coefficient α i , the roundtrip length L i = 2πR i and R i the radius of the MRR [19,33,34].
As illustrated in figure 5(b), the matrix representation (S-matrix) of the DEAP device can be described as follows These specific parameters were chosen to match the on-chip architecture described later in this manuscript. The DEAP core also consists of a BPD, i.e. two photodetectors connected in series in a push-pull configuration. In practice, the drop and thru ports of the MRRs are connected to each photodetector resulting in output at the BPD. In this simulator, the action of a BPD will be represented by the subtraction of drop and thru port transmissions, T D − T T . As shown in [18], input values can only be defined in the interval (0, 1) and weight values in (−1, 1). These conditions fulfill the requirements for image convolutions as positive numbers in (0, 255) represent images (normalized to (0, 1)), and kernel coefficients are defined in the set of real numbers.

M DEAP cores
An architecture designed to implement M DEAP cores is composed of N = 2M MRRs. The equation that models the thru port of the entire device is given by If K ≡ N/2 + 1, the drop port is composed by the following equations , then the output of the DEAP architecture is T D − T T = |S 12 | 2 − |S 11 | 2 . To help smoothly get to these expressions, we added more information in appendix.
For M = 4 and the parameters chosen for this S-matrix model, we found that the simulation (dotted curves in figure 6(a)) is comparable to the experiment (solid curves in figure 6(a)). The difference between simulation and experimental curves is due to fabrication imperfections due to the low yield of the chip. Therefore, predictions made by the S-matrix simulator are valid for the on-chip DEAP architecture. As seen in figure 6(b) for the general M-DEAP core architecture, inputs {I i } and weights {W i } within a core i ä {1, 2, K, M} are considered on-resonance with each other. To evenly distributed those M core channels along the FSR, the ( ) n eff i must be specified. Such distribution can be achieved by placing cores spaced with one another according to  figure 6(d)). The length of such an interval is key for the programming of the whole system as it prevents cross-talk-related issues between adjacent channels when tuning MRRs on-and off-resonance. Inter-channel cross-talk typically occurs when the tails of resonant transmission signals interact with one another in the optical spectrum. This case is non-ideal because cross-talk invalidates channel independence, leading to disturbances between channels when encoding. Beyond the range (3.505, 3.5055), we can expect inter-channel cross-talk for M = 19 DEAP cores cascaded in series. The ways to deal with these issues are by decreasing the number of cascaded MRRs or increasing the quality factor of the MRRs -to avoid the tails of the optical transmission from overlapping.
The number of different inputs and weights that we can encode is specified by the modulation depth of the device, the length of the interval where inputs and weights are encoded, and the number of values that can be defined within such an interval. To determine the number of refractive index values through which digital numbers can be encoded, we divide the refractive index range in H bins with width δ as shown in figure 6(d).

Resolution
Defined by H log 2 , we introduce the bit resolution of the MRRs. Figure 6(e) shows the number of bits versus the bin size δ as functions of the number of MRRs (2M) cascaded in series in DEAP. These results show that the bit resolution decreases with the number of MRRs cascaded in series for a fixed δ. To incorporate more MRRs within an FSR without experiencing inter-channel cross-talk, δ must decrease. For instance, for more than 34 MRRs cascaded in series, we would need to encode numbers with δ ∝ 10 −7 to achieve 10 bits of resolution, which translates into 1024 different input and weight values that can be defined. Although resolution above 9 bits [25] has yet to be experimentally demonstrated in MRR weight banks (due to limitations of MRR control algorithms), here we show the significance of achieving such a high resolution to encode information. Although lowresolution (8 bits or less) neural networks have been shown to perform cognitive tasks [35,36], many applications in machine learning still rely on the highest resolution possible to achieve standard performances. This case is especially true for training purposes.
The MRRs can represent a smaller set of numbers for a larger bin size. This last case is especially suited for experimental setups performing the computation in noisy environments. The purpose of defining bin sizes to encode information is to account for random drifts on the devices caused by temperature fluctuations and noise. Therefore, the information will be represented in analog architectures as an interval of values just like binary numbers '1' and '0' are represented by voltages in (2.7, 5.0) V and (0.0, 0.8) V in actual transistor-transistor logic gate circuits, respectively.

Conclusions
We have introduced an experimental demonstration of a DEAP core and a physical-level simulator based on scattering theory that accounts for optical cross-talk-which was not accounted for in the original simulation described in [18]. The simulator considers the flow of the electric field through the devices and generates a transfer-level matrix that describes the system's behavior. The coefficients of the scattering matrix allow us to examine the information flow at thru and drop ports in dependence on round-trip loss, self-coupling coefficients, refractive index, and radius. Our S-matrix simulator can help predict how a DEAP architecture performs regarding resolution and inter-channel cross-talk when scaling up a given system. As this is a physical transfer level simulator, it is platform agnostic so that it can be implemented in python, Matlab, or any other efficient package for scientific computing.
The S-matrix simulator was tested on simple image convolutions. The results were compared with an experimental demonstration of an on-chip 4-DEAP core and a digital simulation. Experimental vector operations could be successfully performed using two sets of on-chip MRRs to encode image and kernel values. Performance was qualitatively compared when solving color inversion and Gaussian blur convolution tasks, showing that specific critical characteristics in the definition of both convolutions are preserved and that they coincide in both S-matrix simulations and experiments. An on-chip DEAP architecture with such features has a limitation of 18 cores for larger-scale implementations after considering inter-channel cross-talk and FSR constraints. In practice, the resolution of the δbinning process will depend on the resolution of the device that we use to sweep the voltage. For instance, the 2600 Keithley SMUs have a programming resolution of 50 μV for a range of 2 V, with which we could encode 20 thousand values, so almost 14 bits of resolution. Considering that the experimental look-up tables were built up with voltage values in (0, 2.5) V, we could achieve such a high resolution without experiencing inter-channel cross-talk for the current experiment. However, drifts caused by noise and temperature fluctuations might impact the theoretical resolution that can be achieved. In this case, control systems dealing with those drifts will help relax constraints and limitations.

Acknowledgments
This work was partly supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants Program.

Data availability statement
The data cannot be made publicly available upon publication because no suitable repository exists for hosting data in this field of study. The data that support the findings of this study are available upon reasonable request from the authors. Equations (9) and (10) can be used to estimate S-matrix coefficients where we get a spectrum {T T , T D } with two resonance peaks within an FSR as shown by figure 4.