Research in methodologies for modelling the oral cavity

The paper aims to explore the current state of understanding surrounding in silico oral modelling. This involves exploring methodologies, technologies and approaches pertaining to the modelling of the whole oral cavity; both internally and externally visible structures that may be relevant or appropriate to oral actions. Such a model could be referred to as a ‘complete model’ which includes consideration of a full set of facial features (i.e. not only mouth) as well as synergistic stimuli such as audio and facial thermal data. 3D modelling technologies capable of accurately and efficiently capturing a complete representation of the mouth for an individual have broad applications in the study of oral actions, due to their cost-effectiveness and time efficiency. This review delves into the field of clinical phonetics to classify oral actions pertaining to both speech and non-speech movements, identifying how the various vocal organs play a role in the articulatory and masticatory process. Vitaly, it provides a summation of 12 articulatory recording methods, forming a tool to be used by researchers in identifying which method of recording is appropriate for their work. After addressing the cost and resource-intensive limitations of existing methods, a new system of modelling is proposed that leverages external to internal correlation modelling techniques to create a more efficient models of the oral cavity. The vision is that the outcomes will be applicable to a broad spectrum of oral functions related to physiology, health and wellbeing, including speech, oral processing of foods as well as dental health. The applications may span from speech correction, designing foods for the aging population, whilst in the dental field we would be able to gain information about patient’s oral actions that would become part of creating a personalised dental treatment plan.


Introduction
Technologies capable of creating 3-Dimensional facial models (based on some form of inputted data, videos, pictures etc.) are often limited to only the external view, seldom providing manipulable cross-sections with observable internal structures.Existing technologies that can help in producing real time models of the mouth, such as Electromagnetic Articulography and Electropalatography, are limited in their use (Kochetov 2020a; Rebernik et al 2021); a consequence of their resource and cost intensive running cost and inability to encapsulate the movements of all articulators.
Although the problem statement centres around oral actions, the review is limited not only to the mouth.It explores elements of medical and computer science fields that have demonstrated the use of ideas, approaches, and methodologies capable of addressing the problem statement.A survey of literature surrounding mouth specific movements and structures is a vital addition to forming an understanding of what a complete 3D oral model would consist of.Logically, the next step would be knowing how these structures relate to the actions the mouth performs i.e. speaking, chewing/swallowing, breathing etc.The paper also takes an in-depth look at existing methods used to capture oral movements during action.These are a variety of 2D and 3D approaches that, to varying degrees, are currently used in helping to visualise oral movements.
Mapping the movements of the mouth is made difficult by the mouth's complex and deformable 3-dimensional structure, and featuring as it does external and internal elements.The complex movements of the mouth are the result of the interaction between multiple elements, including both soft tissues (such as the tongue and velum) and hard structures (such as the jaw and teeth).One such approach to addressing this problem is by forming predictive models that can accept observations of (easily accessible) external movements and create a predicted internal structure.Such techniques have been spearheaded within the medical field of radiology and will be explored in regards to their potential for application in oral modelling.
To fully comprehend the intricacies of oral actions, it is essential to closely examine the movements that occur within the vocal tract and their relationship to oral physiology.

Articulatory phonetics
Before we can explore the Computer Science and Clinical Phonetics fields that address the problem of oral modelling and speech analysis, it is important to understand the articulatory phonetics that govern the production of speech.This section defines relevant terminologies pertaining to this field, and identifies the 'external' and 'internal' articulators.
The first pertinent definition is that of articulatory phonetics itself; Articulatory Phonetics, a subfield of phonetics, can be defined as a field of phonetics that focuses on the study of how speech sounds are produced in different languages by examining the movements and positions of the vocal organs, also known as articulators (Keating 2001).
The definition touches upon another closely linked process worth defining: Articulation.Articulation is the means by which speech is formed, through the movements of vocal organs called articulators.In phonetics, articulation has been defined as the movement and/or positioning of the vocal organs (such as the tongue, lips, and jaw) during speech production.These movements and positions influence the shape and configuration of the vocal tract, which in turn affects the quality and characteristics of the resulting speech sounds (Ladefoged and Johnson 2015).
Speech recruits the use of various vocal organs, known as articulators.During this articulatory process some of these vocal organs may be externally visible.These include the upper and lower lips, as well as the teeth and at certain times the tongue when it is protruding through the lips and teeth; these are the 'external articulators'.However, most of the oral cavity inside the mouth cannot be seen externally and thus form what are referred to as 'internal articulators'.
Throughout this paper, we will also refer to the vocal tract.The following is a definition of the term: 'vocal tract is the term used to refer to the entire speech apparatus, with the larynx as the central element which subdivides the apparatus into lower and upper regions' (Ball 2021).

The articulators
Before we dive into the complexities of speech production and oral movements, it's important to familiarise ourselves with the various articulators involved in the process.These include the lips, jaw, tongue, teeth, hard palate, velum, and larynx.The external articulators primarily consist of the jaws and lips, while the internal ones include the hard palate, velum, and larynx.It's worth noting that the role of the teeth and lips in articulation can vary depending on the specific sentence being spoken.

Speech production
Speech production (SP) is the process by which words are spoken.This may seem to be the same as the previously described 'Articulation', but there is a difference.Speech production involves the physical creation of speech sounds, as well as hearing, perception, and information processing in the nervous system and brain.The process is complex and involves a feedback loop to ensure the speech produced is meaningful (Docio-Fernandez and García Mateo 2015).
In other words, SP is the complete process by which initial thoughts are translated to speech; articulation is just one part of SP.There are three main stages to the process: initiation, phonation, and articulation (there is also an additional fourth, coordination).(Ball 2021) provides a detailed overview of the SP process, exploring in detail these three stages.Below is provided a short summation of each stage, paying closer attention to systems that engage the vocal tract (rather than other systems i.e. lungs).

Initiation
As the name suggests, this initial state illustrates the beginning of the speech process.The previously mentioned definition of articulation was defined as a process that modifies an air stream to produce the sounds of speech; the initiation of speech is the method by which humans generate the air pushed upwards through the vocal tract.This method of air generation is known as the airstream mechanism and can initiate from three points of the body: the lungs, the velum, and the glottis.
Airstream mechanisms beginning from the lungs are controlled by the pulmonic system and are thus called Pulmonic airstreams.Contractions of the ribcage, controlled by the diaphragm, work to fill the lungs with air to then be released through to the vocal tract.Velaric airstreams are the redirected flow of air produced by the lungs into the oral or nasal cavity, this is a task completed by the velum.The raising and lowering of the velum dictates normal breathing or production of nasal sounds.Glottalic airstream mechanisms control the movement of air by action of the glottis.The opening and closing of the glottis form an upward or downward movement of air, to then reach the second point of articulation (further down the vocal tract).Most sounds produced on a glottic air stream are ejectives, such that sounds are formed by air being pushed out through the mouth and nose (also referred to as an egressive airstream).

Phonation
Phonation is the secondary stage of vocal sound and speech production, a process by which the previously mentioned pulmonic egressive airstreams undergo pressure changes induced by the motion of two vocal folds situated in the larynx.Movements of the cartilage structure surrounding the larynx open and close a triangular-like space between the vocal folds that allow the passage (or restriction) of air; this space is called the glottis, an opening crucial in forming vowels and other consonants.
Consonant sounds produced within common speech are a result of two main vocal fold configurations.The first are 'voiced' consonants.These are created when the vocal folds are held together and vibrating, thus creating a narrower glottal aperture; an example of such a word would be 'broom'.The second, 'voiceless' consonants, are a result of a larger glottal opening with an example of such a word being 'hat'.

Articulation
This is our final state of interest.Expanding upon the previous definition provided, articulation refers to the shaping of the resultant airstream, generated and altered during the initiation and phonation stages; at this point the articulators are configured to form the desired labialisations.
Table 1 classifies the passive (rigid) or active (mobile) motion of articulators, and the IPA symbol of which voiced or voiceless consonants they create.
Additionally, are provided examples of consonant types, and sample voiced/voiceless fricatives.
It is worth noting that in some speech, two simultaneous primary places of articulation can occur.This is called double articulation.For example, labial-velar consonants are doubly articulated and engage the use of both the velum and lips.Now that the articulatory process has been discussed, speech itself can be defined.Simply put, this can be explained as the use of vocal organs to generate speech.However, a more formal definition can put it as: 'Kmovements or movement plans that produce as their end result acoustic patterns that accord with the phonetic structure of a language.'(Kent 2015).
Speech taxonomies, that is defining the various speech behaviours, are generally a well-researched sphere.To provide an idea of what these constitute, some have been listed below (Kent 2015): • Emotional speech: Speech that expresses an emotion such as anger, sadness, happiness, or fear; sometimes contrasted with neutral speech • Empty speech: Speech that is semantically void (e.g., comprising automatisms, vague circumlocutions, or single words) • Exaggerated (overarticulated) speech: Speech produced with unusually large ranges of articulatory movement and/or force; similar to hyperspeech but with more deliberate and extensive movements • Nonsensical speech (nonsense): Speech that does not convey meaning, usually because it involves phonetic sequences that do not conform to the words in a given language

Nonspeech oral movements
Both verbal and nonverbal actions, are governed by the craniofacial and masticatory musculatures of the face; more specifically these include movements pertaining to speech, facial expressions, biting, chewing, ventilation, and swallowing (Kent 2015).This section will now review the nonspeech elements of oral action.
4.1.Nonspeech oral movements Kent (2015) reviews a vast array of literature to collate definitions and propose taxonomies for both speech, and non-speech oral movements (NSOMs).Although definitions and taxonomies for the oral process can vary, the paper provides clear descriptions of the movements themselves; thus, this evaluation of NSOMs refers back to Kent (2015) often.The narrative review defines NSOMs as: 'Motor acts performed by various parts of the speech musculature to accomplish specified movement or postural goals that are not sufficient in themselves to have phonetic identity' In essence, NSOM's cover a vast range of orofacial movements that are performed alone or with other movements for varying purposes; governing these movements are the articulators and facial muscles.Alongside speech, facial muscles serve two main nonspeech functions, chewing and facial expressions (Westbrook et al 2022).The following sections will explore these movements.

Mastication
The chewing process, also referred to as mastication, is a motor activity intended for processing food in preparation for swallowing.The complex process involves the action of the suprahyoidal muscles, craniofacial musculature, vocal organs, and even saliva (van der Bilt et al 2006).The process is complex in the sense that the movements for mastication are formed by multiple interacting parts.Although the chewing process can be explored in much detail, this review is primarily interested in how processes engage the articulators and will thus primarily focus on such literature and taxonomy pertaining to the vocal organs.
As mentioned, mastication aims to break down and crush food to be mixed with saliva and moved to the back of the throat for deglutition (swallowing).The 'muscles of mastication' consist of the muscle groups: temporalis, masseter, medial pterygoid, and lateral pterygoid.
However, it is key to note that the chewing process involves more than just the 'muscles of mastication'.Neurological control of the jaw and other muscles, individual anatomy and even the types of food being processed govern the cycle of mastication adopted; with certain foods having a longer/shorter cycle (Soboļeva et al 2005).

Facial expressions and other NSOMs
Alongside mastication, the process of facial expression generation is one of the main NSOMs surrounding 'all things oral action' that we are trying to unfold.Certain facial expressions generated adopt the use of identified articulators or oral motor systems: this includes facial expression such as smiling and surprise, as well as lip pursing, jaw opening, and tongue protrusion (Kent 2015).However, others draw on the use of nonidentified systems: these include actions such as coughing, laughing, and blowing.
At times facial expression may be a consequence of another movement.Coughing, for example engages muscle systems including the respiratory system (among others); during which process the distinct 'coughing facial expression' is produced.Kent (2015) provides a table of proposed speechlike and non-speech movements, categorised into the muscle systems they employ and their general function.Below are identified some of these movements, the full classification of which can viewed within their paper: Oral only: Licking, Sucking, Smiling, Respiratory: Subglottal air pressure control, Prolonged expiration Respiratory and laryngeal: Grunting, Moaning, Crying Oral and respiratory: Panting, Blowing, Sighing, Whistling Oral, Laryngeal and Respiratory: Coughing, Laughing Additionally, certain NSOMs produce an audible output as a result of the action i.e. coughing, panting, moaning, laughing.

Capturing of articulatory actions
Having now identified the various articulators that form the speech process, it is just as important to realise how these structures and motions can be recorded.Recording and quantifying the movements of the articulators is a difficult task.Depending on the needs of the experiment/research, any specific methodology can be desirable.There are currently in use various technologies capable of capturing the movements of the vocal tract, each one addressing the five vocal organs to a varying degree.
Table 2 provides a modified extract from Kochetov (2020a).It lists the 12 methods considered here and then indicates the individual capabilities of each of these systems.The final column of the table, titled 'MRI scan highlighting the articulators recorded' shows the relevant articulators highlighted in different colours.Note, that the method may not necessarily engage the use of MRI, rather the image aims to only display the relevant organs to the reader.The methodologies chosen are often subject to financial constraints, as well as access to machinery and trained staff.Furthermore, certain research methodologies require the collection of both auditory and articulatory data.In the case of Electromagnetic Articulography (EMA), this becomes problematic since the acoustics are changed when sensors are attached to the tongue and around the mouth (Meenakshi et al 2014).
Kochetov (2020a) reviewed 379 full research articles published between 2000-2019, to find out which methods of articulatory recording have been used most during this 20-year period.The survey included papers published in the field of Language and Speech, Phonetics, Phonology and more.The results showed that around 60% of these experiments used Electropalatography (EPG), Ultrasound, and EMA techniques for articulatory recording.The other 40% was occupied by the remaining techniques, with MRI notably only taking up 6% of used methods, despite being a technique that can quantify the movements of all the articulators, the whole vocal tract.
Electromagnetic Articulography is the most used, however it is also important to consider that often techniques are used in conjunction with one another.
This paper provides a summation of the various techniques mentioned.For each of the recording methods, below can be found a brief overview of the approach, along with examples of use, overall safety of the technique, sampling rate, audio compatibility, cost, availability of data sets and also a recent review that has been completed; a review that took place after Kochetov's review (2021 or later).The section aims to offer the reader an oversight to the current use of various techniques and also provide a base from which a researcher can select a method suited to their needs.
In some instances, including cases where the technique has not been in use for quite some time, a recent review, cost, or example of a dataset have not been found.It is worth noting that factors such as cost and safety overview are presented for a casual comparison, but a more in-depth, up-to-date investigation would be required by the researcher before adopting a technique.

Electropalatography (EPG)
Electropalatography is a technique introduced in 1970, used to identify the tongue and hard palate location during articulation; the technique's ability to record dynamic speech features further allows for the detection of sound production (Mat Zin et al 2021).During the process, a custom-built artificial plate is placed within the speaker's mouth, and subsequently clipped on to the individual's upper palate.The palate is lined with a grid of electrodes, capable of registering the contact taking place between the tongue and the roof of the mouth (Verhoeven et al 2019).Detailed in table 3, the technique allows for quantifying where and how the tongue touches the roof of the mouth during speech.The EPG is can also be used to analyse contact patterns during real-time speech generation.(Hardcastle et al 1989).With every consonant uttered, a unique contact pattern is produced on the hard palate.This can be used to identify the sound produced during speech, with the location of the tongue and hard palate being detected by electrode sensors present on the artificial palate.

Advantages
This technique is suitable for children and individuals with disabilities who find it difficult to remain still.

Disadvantages
The retainer-like contraption placed against the hard palate means the technology is unsuitable for individuals who already use dental prosthetics.Data is limited to the oral gestures of the tongue.Provides no information about the location of the tongue when not in contact with the hard palate.Method is also invasive, as plate has to be placed in the mouth.

Examples of use
Wood (2010) -used EPG on individuals with Down Syndrome and found that they can continue to improve their speech production and intelligibility as they progress from adolescence to adulthood.

Overall safety
Material used in developing EPG is nontoxic (Mat Zin et al 2021).The artificial palates are made from acrylic resin, silver electrodes and copper wire; material that is FDA approved and widely used in dental applications (such as dentures and retainers and EMG).

Sampling Rate
The linguopalatal contact is tracked dynamically, typically taking samples every 10 milliseconds (Kochetov 2020a).

Audio Compatibility
Yes.

Recent Review
Mat Zin et al (2021), 'The technology of tongue and hard palate contact detection: a review'.

Examples of Available datasets
EPG data from two female speakers of Central Arrernte.Both subjects recorded uttering the same words using two different sorts of palates (Tabain 2011).

Ultrasound
Ultrasound is an imaging technique introduced in the 1960s (Kelsey et al 1969).It has since become the second most popular method of articulator recording over the past 20 years, as shown in table 4 (Kochetov 2020a).The technique uses a transducer probe, capable of omitting a high frequency sound wave.When held against the neck, the thin beam projected from the probe travels through the tongue tissue and is reflected back to the transducer, forming a 2D image of the tongue.Ultrasounds inability to image bone or air means it does not allow for the visualisation of the palate, jaw, or rear pharyngeal wall; making it suitable only for imaging the tongue in speech research applications (Bliss et al 2018, Kochetov 2020a).Although most ultrasound machines are stationary devices situated in hospitals, mobile USB probes are now being used more often, that make the recording process more convenient and accessible.

Electromagnetic articulography (EMA)
Electromagnetic Articulography is a point tracking technique (Mennen et al 2010), during which a series of sensors placed on target articulators record realtime movements in 3D (table 5).Later developments in the approach have led to its capability of taking five dimensional recordings, collecting three cartesian coordinates and two angular coordinates (Hoole and Zierdt 2010), therefore capturing information in

Advantages
With availability of smaller portable ultrasound systems, this technique is affordable and accessible to researchers.In the past, a small sample rate has meant that short articulations were not captured, or in poor quality.This problem has since been reduced with the introduction of higher frame rate devices.

Disadvantages
The observation of the motion at the tip of the tongue is missed when the tongue is raised or extended forward (Cleland et al 2011)

Static palatography
Static palatography is technique developed in the 19th century, used to study constant articulation (Kochetov 2020a).During this process, the tongue is painted black using an edible paint-like material to record the contact it makes makes the roof of the mouth during articulation (Anderson 2008).A mirror is inserted into the subjects mouth and a photo or video is taken, to show the location of paint traces on the hard palate (post articulation).As indicated in table 6, the simplicity of the technique makes it perhaps one of the most accessible methods of articulation visualisation.

Video and optical tracking
Video and optical tracking is a simple, non-invasive method to record the movements of a patient's lips, jaw, and to some extent the tongue.When coupled with the uses of a mirror, the method also allows us to see a side view profile of the individuals mouth during speech.Displacement of the visible articulators are used to understand lip configurations.Additionally, the technique can be used in conjunction with other articulatory recording systems, such as ultrasound and EMA.Further details are highlighted in table 7.
5.6.X-ray microbeam X-ray microbeam (XRMB) is a computer-controlled point tracking system that uses a narrow (0.4 mm in diameter) x-ray beam to locate and track the movements of gold pellets attached to the target organ (in this case, target articulator); these include the lips, jaw, and tongue (Barlow and Stumm 2009).Serving as a reference point, two additional gold pellets are attached to the bridge of the nose.As presented in table 8, the scanned images produce a shadow, detected on a sodium iodide crystal detector, which is then transmitted to a computer that allows us to study the movements of the articulators.(compared to the former).It works by shining an external light down the oesophagus, and using external sensors placed on the skin surface below the glottis.These sensors detect changes in light intensity, and therefore provide an indirect image of glottal width.

Oral airflow/pressure
Techniques in this section observe the study of oral airflow and air pressure, called pneumotachography.
The technique can record the movements of the lips, tongue, velum and the larynx.Data regarding oral air flow is collected using a mask that is placed around the patient's mouth, the individual speaks into the mask while holding it against their mouth; this process can be uncomfortable (Hirshkowitz and Kryger 2017).Alongside audio, the system records speech air flow measured by the volume of air that leaves the mouth within a certain period.Intraoral air pressure can be monitored by using a small tube attached to the mask that is inserted into the patient's mouth (Kochetov 2020b).Further details are highlighted in table 11.

Nasal airflow
Nasalance is a subsequent method following on from the previous that allows us to measure nasal air flow.
The technique uses two microphones positioned between the nose and the upper lip to measure the amplitude related to the air released by the nasal tract and air emitted by oral tract.The nasal air flow provides a rough measure of velum height.The device used for the recordings (such as the Nasometer II 6450) is held up to the mouth by the patient (Kochetov 2020b).Further details are highlighted in table 12.   et al 2023), and the associated protocols, even when the dataset is not immediately available, serve as valuable guidelines for others to gather high-quality data (Lim et al 2023, Wu et al 2023).Notably, there is a uptick in the utilisation of machine learning techniques in tasks related to vocal tract MRI (Ribeiro et al 2022, Laprie et al 2023, Ruthven et al 2023).Additionally, a toolkit for assessing vocal tract shape has been developed (Belyk et al 2023).
As the interest in Real-time Magnetic Resonance Imaging (rtMRI) continues to surge, researchers are continually driven to seek novel and innovative solutions for advancing speech analysis research.
Although recent evidence highlights the promising potential of rtMRI, it is not immune to significant limitations that are shared by the majority of techniques used to study the vocal tract.They are usually too expensive or invasive, and quite often both; those that bypass these constraints are limited to only one or two articulators.They are not practical for consumer purposes, and companies often cannot spend the amount of money that is required in performing data collection for systems such as MRI and EMA.Furthermore, individuals are not interested in partaking in a time consuming, and at times invasive process.As a result, collection of primary data is limited.This is a problem previously identified by a University of Southern California study group specialising in speech production and articulation (SPAN).They are working on bridging this gap by creating open source MRI datasets aimed at fuelling the development of applications and ideas inspired by AI and machine learning methods (Lim et al 2021).
In order to aid a wider research community, low cost, non-invasive and time efficient systems and methodologies are required.One such approach can be inspired by the use of internal-external correlation modelling.There is very little research in applying this technique in creating solutions for the oral cavity, but we have seen similar and relevant approaches applied to other parts of the body.Below we will address and review these works that link the external and internal.They are relatively non-invasive techniques that predict internal structures, based on external observations.

Internal-external correlation modelling
Internal-external correlation models are a method to estimate the motion and presence/location of an internal object, based on its external view.Although the use of such an approach has not yet been fully explored in oral modelling, it can be found being used in other organ modelling systems.
Chen et al (2018) explore the development of a local topology preserved non-rigid point matching algorithm, used in creating an internal-external correlation model for internal action mapping with applications in lung cancer radiotherapy treatment.Organs and tumours in the thoracic region go through significant respiration-induced motion -translation, rotation and deformation.This motion can be utilised to accurately track both tumours and surrounding organs at risk.This is done by registering the vector fields, which describe the motion between internal and external components.They are acquired by individually aligning the meshes of internal organs and external surfaces from the images via the developed algorithm.
Several other studies have also demonstrated the feasibility of finding correlations between internal and external motions, detected by respiratory surrogates.Fayad et al (2011) aimed at assessing motion correlation between a patient's external surface and internal anatomical land marks.They concluded that it is possible to reduce variability and associated errors in respiratory motion synchronisation and motion modelling process by capturing in real-time the motion of the complete external patient surface as well as choosing the area of the surface that correlates best with the internal motion.Martin et al (2013) presented a novel method to build a surrogate driven motion model of a tumour using a Dental Cone Beam scan, without the need of markers.The method was shown to extract tumour motion from a variety of lung cancer patients, with tumours present in different location within the cavity.By tracking the movement of an external reference point in real time, doctors can use this model to guide treatments that are synchronized with the tumour's motion.The model is created just before each treatment session to account for any changes in the tumour's position.This method also helps doctors better understand the shape and movement of the tumour before delivering precise radiotherapy treatments.The method involves two steps.First, the tumour area is highlighted in the CT scan images.Then, the model is created based on the movement of an external reference point.In tests using simulated data, the average difference between the estimated and actual tumour positions was reduced to just 1 millimetre.When applied to real patient data, the average difference between the estimated and clinically-identified tumour positions was less than 2.5 millimetres in both up-down and sideways directions.

Modelling the Interrelationship between the face and vocal tract
Applying this approach to our problem requires us to first define the internal and external components.The external component is that of the face, and the internal is the vocal tract.The face during articulation (or indeed mastication) can be captured using a RGB recording camera.These videos or pictures can be captured from several views, including the forward and side on views.Figure 1 shows a still frame from a video where the participant shown utters the phrase 'Miss black thought about the lap'.It includes the simultaneous capturing of the coronal (frontal) and sagittal (longitudinal) planes.
The internal view can be represented with either of the articulatory recording techniques mentioned in section 4. Depending on a researcher's specific requirements they could choose any of the 12 methods.Ideally the chosen technique would be one that involves the use of as many articulators as possible, to maximise the learned traits from the two modalities.Out of all the currently viable techniques, real time MRI is the single option that can record all the articulators as well as provide a view of the entire vocal tract.Due to this very reason, rtMRI stands out as one of the most suitable techniques to represent the internal view.Figure 2  Continuing with the two modalities mentioned, the task here would then be to find the correlation between the representation depicted in a RGB external camera view, and the MRI internal vocal tract view.To simplify the explanation, the problem can be expressed as follows: Variables When determining D or D′ in various combinations of internal or external modes of representation, certain factors should be taken into account.For each specific articulation, there exists an absolute ground truth for both the internal and external views, which corresponds to the real-time movement of the articulators.However, the choice of modality used to represent either of the two views is limited by the constraints of the specific technique employed.For instance, in the case of MRI, factors such as pixel resolution, frame rate, or the fidelity of MRI signal deconvolution can impact the representation and potentially alter the ground truth view, depending on the MRI machine being used.Therefore, any interrelationship between the two views must consider the fact that the captured modality represents an interpretation of the absolute ground truth.It is therefore crucial to acknowledge that the modality employed to capture the views is subject to specific constraints, potentially leading to variations in the ground truth.Any connections drawn between the two views should consider the interpretive nature of the captured modality.
As far as our findings indicate, (Scholes and Skipper 2020) is the only work aimed at investigating the link between facial and vocal tract movements during speech production.They formed a unique dataset that consists of paired, temporarily aligned videos of both the face (captured in the front on, side and 45-degree angles) and sagittal MRI view during 10 different utterances.Using this aligned cross modal dataset they applied principal component analysis (PCA) to demonstrate that the MR images sequences can be reconstructed with high fidelity using videos of only the external face.The PCA worked by capturing dynamic regions of the vocal tract, such as the tongue and lips, while ignoring static areas with little movement though an utterance (such as brain/spinal cord).MR sequences could then be reconstructed by projecting the video input data into the MR PCA space generated; the opposite was done for the generation of the external view from the internal view.Resultant reconstructed MR sequences (from video input) were very similar to the original sequences.Their work revealed that there is sufficient information in the face to recover vocal tract shape during speech, for set utterances, and likewise reconstruct sequences from either of the two imaging domains.However, it does not bypass the need for acquiring the MRI sequence itself.To reconstruct either of the sequences, the PCA space of the opposing modality is still required; so, to create a reconstructed MRI sequence, the original MRI is still needed.A solution apt at addressing our problem statement must be able to produce the MRI sequence of an external video without having its specific corresponding PCA space.In other words, be able to produce an internal representation of the vocal tract without needing a specific matching external view.This generative approach would be a result of the learned correlation between the face and vocal tract during articulation.
A solution to address this problem could potentially be found in computer vision/deep learning approaches, a field centred around learning the characteristics of a dataset to then predict and interpret visual information.By leveraging the power of neural networks and advanced algorithms, computer vision and deep learning can analyse images or videos, extract meaningful features, and make novel predictions based on the learned patterns.

Deep learning approaches for image synthesis
In the previous section we briefly discussed the main drawback of traditional statistical correlation modelling approaches when it comes to image synthesis, the need of both imaging domains for an inference.Deep learning, a subset of machine learning, is well posed to provide a solution to this computer vision problem.
Computer vision is a field of AI aimed at enabling computers to derive meaningful information from visual stimuli.These tasks can range from low-level edge detection to a high-level task such as complete scene understanding.Over the last decade, impressive developments in computer vision have come because of advancements in deep learning.
Deep learning is a machine learning method used in training artificial neural networks.With the growing availability of large scale datasets and ever increasing processing power of computers, researchers are apt in developing pattern recognition models, for use in many fields including medical imaging (Esteva et al 2021).
Deep learning involves training artificial neural networks to learn patterns and make predictions.As the availability of extensive datasets and the processing power of computers continue to expand, researchers have been able to develop highly effective pattern recognition models.These models find applications in various domains, including medical imaging, as demonstrated by Esteva et al (2021).
The backbone of deep learning in computer vision is the Convolutional Neural Network (CNN).These are specifically designed to analyse visual data by mimicking the human visual system.CNNs consist of multiple layers, including convolutional layers, pooling layers, and fully connected layers.Convolutional layers perform feature extraction by applying a set of filters or kernels to the input image.Each filter captures different visual patterns, such as edges, textures, or shapes, and convolving this with the input image produces a feature map.The use of shared weights and local receptive fields in convolutional layers enables the network to learn hierarchical representations of the input data.Pooling layers reduce the spatial dimensions of the feature maps by down sampling them.This helps in creating more robust features by discarding irrelevant spatial information and retaining important features.Common pooling techniques include max pooling and average pooling.Fully connected layers are responsible for making predictions based on the features extracted by the previous layers.These layers connect every neuron from the previous layer to every neuron in the current layer, enabling the network to learn complex relationships between features and make accurate predictions.
The problem we wish to address is how a dataset of paired MRI and external view images can be used to synthesis MRI views of the face, from patterns recognised between the two modalities.This falls into the field of Image-to-image translation.Image-to-image translation is the process by which an image from one mortality is transformed into another, with the aim of learning the relationship between the input and output image.This is a deep learning task (often addressed using generative adversarial networks).Such models are best trained with datasets of paired and aligned images.The concept of image-to-image translation allows for various applications, such as converting images from one style to another (e.g., grayscale to color), transforming images across different modalities (e.g., day to night), or synthesizing realistic images from rough sketches.It's important to note that successful image-to-image translation relies on having a well-prepared and representative dataset for training, as well as careful selection and design of the deep learning architecture and loss functions.Additionally, understanding the limitations of the approach, such as potential artifacts or biases in the synthesized images, is crucial for ensuring the reliability and accuracy of the generated data.
A recent survey of Cross-Modality Synthesis was done by Xie et al (2022), that comprehensibly approaches this complex task from different perspectives, including the level of supervision, loss function, range of modality and downstream tasks.The downstream task in this case would be the use of the MRI generated image.

Semantic segmentation of the articulators
Exploring how to extract meaningful information regarding articulatory movements from both original and generated MR sequences is a vital step towards further realising how rtMRI can be used in articulatory research.Knowing the relative positions of each of the vocal organs in a given frame will allow for clearer image understanding; thus improving the fidelity of any generated outcomes.For this process, researchers employ segmentation techniques to analyse vocal tract MRIs, enabling a comprehensive evaluation of the vocal tract's structure and function during speech.
Image segmentation (or more specifically Medical Image Segmentation) is a process used in identifying meaningful regions and structures within a medical image, a process through which a desired object (vocal organ) is extracted from a medical image (2D or 3D) (Li et al 2021).The modality of acquiring the medical image can be through systems such as CT, MRI, X-ray and more.
The process in our use case involves precisely delineating the various anatomical components, including the tongue, lips, jaw, and velum, within the vocal tract.Accurate segmentation facilitates the extraction of quantitative measurements and geometric data about the vocal tract's regions.These measurements offer insights into speech production biomechanics, aiding in the understanding of speech disorders, language development, and treatment efficacy.
Segmentation approaches include manual delineation, semi-automated algorithms, and deep learning-based methods.Manual delineation involves experts manually tracing boundaries, ensuring precise results but requiring significant time and effort.Semiautomated algorithms assist by providing initial segmentations that can be refined manually.Deep learning techniques, employing convolutional neural networks, automatically recognise and segment vocal tract structures, reducing time and effort.
In the case of the medical field, this is often for planning and guiding operations as well as measuring the outcome of therapeutic procedures (Kapur et al 2014).
During image segmentation, the various sections of the target image are delineated and are given labels.
To put this into perspective, we can observe an example of an annotated (delineated) still image by Ruthven et al (2021).Figure 3 shows an MRI view of the vocal tract to the left, alongside the same image annotated with each colour representing a different articulator.
Several segmentation technologies are available that can perform image segmentation utilising machine learning and deep neural networks.These approaches require ground truth data to train models to accurately segmenting new, unseen images.However, the delineation process is widely acknowledged as highly complex, and as a result, a significant portion of the annotation process continues to be performed manually (Wallner et al 2019).
Segmentation techniques used (also to create ground truth data), can be broadly divided into two categories: intensity-based segmentation and shapebased segmentation.Each of these two methods have various semi-and fully-automatic segmentation algorithms.The following section will present examples of these and discuss some of the algorithms in use, in relation to the techniques adopted in the segmentation of rtMRI images of the face (as in figure 3).

Intensity-based segmentation
Intensity-based segmentation (IBS) relies on the principle that voxels within the target object, such as an organ, possess a distinct grey value (intensity) different from their surrounding structures.Even if this disparity is subtle and imperceptible to the human eye, models can effectively discern these differences.However, medical images often exhibit a wide range of grey scale distribution within the target object itself, which poses challenges in accurately distinguishing voxel intensities.IBS models encompass various techniques, including thresholding, clustering, deep learning, watershed, and graph-cut, each with its own advantages and applications.
Thresholding-based segmentation is particularly effective when applied to images with high voxel contrast compared to their surroundings.This technique is well-suited for imaging bony structures and their surrounding tissues in CT scans, where there is a significant contrast in voxel intensities.Clustering is an unsupervised learning method that groups voxels within an image based on their similarities, without the need for ground truth data.It can identify clusters of voxels with similar characteristics (e.g.intensity), aiding in the segmentation process.Region-growingbased segmentation is an iterative process initiated by selecting a single seed point within the target object manually.From this seed, the region grows and expands until it encompasses the entire target object.The underlying assumption is that voxels within the same object are similar, allowing the algorithm to determine when to stop the region's expansion.
In summary, intensity-based segmentation techniques offer various approaches for segmenting medical images.By leveraging concepts like thresholding, clustering, deep learning, and region-growing, these techniques enable the identification and differentiation of voxels belonging to the target object, despite the challenges posed by the wide range of voxel intensities within the image.

Shape-based segmentation
In shaped-based segmentation (SBS), the outline of the target object is roughly known in advance, such that segmentation can be completed by identifying a particular shape.In the case of the mouth this could be likened to, for example, identifying the positioning of the upper lip.Such methods explicitly use prior knowledge of a target shape, such that the target shape is learned from a group of pre-annotated shape templates.These techniques include statistical shape models, statistical appearance models, and atlas based segmentation.These pre-annotated images, also known as template images, may limit shape variations and differences, as they may not necessarily be present in the target images; poorly annotated images can reduce segmentation quality.
Statistical shape models (SSM) work by mathematically describing the geometric shape of the target object.The variations in the target object are learned through shape templates (annotated images).This three-stage process involves: the construction of shape templates, SSM creation from shape templates, and adapting SSM to new image.
A good SMM will use a large set of shape templates to allow the model to learn shape variations and variabilities that occur.These shape templates are formed by manually annotating medical images.Once the model has been trained it should ideally be able to identify the target shape when segmenting, thus allowing it to be applied to new images.
Statistical appearance modelling (SAM) works on a similar principle to SSM, but additionally incorporating the appearance of a shape.This includes the colour and 'texture' (e.g.represented by voxel intensity) of the target object.
Atlas-based segmentation is an SBS technique that that can segment images without the need of welldefined delineations between regions and pixel intensities.The approach utilises reference images and corresponding segmentation templates (atlases) to form transformation matrixes, enabling reference images to beregistered with the new image itself.The atlas is able to provide an approximate location of the object position in an image and this information therefore allows the model to localise the object within the new image, and further distinguishes between the object of interest and its surrounding.
Active contour modelling, another SBS, differs from the previously mentioned models.It does not necessarily require such training templates.In this case the algorithm utilises the contour present within the image to form a delineation.This form of modelling can be seen in use within photo editing software, for examplethe Lasso tool in Photoshop.Through an iterative process, a user places several marks around a target object present in an image, which the model then connects based on the contour around the shape.This does however mean that the initial contour must be provided by a user manually, most of the time, for the contour to be then found automatically.

Deep learning semantic segmentation approaches
The previous two subsections primarily went over earlier segmentation algorithms, approaches that are still currently in use.As section earlier touched upon, for the past several years deep learning-based approaches have paved a way for a new generation of image segmentation models with outstanding performance improvements (Minaee et al 2020).
A deep learning segmentation pipeline typically consists of dataset preparation, network architecture selection, training, validation, and inference.First, a labelled dataset is created, comprising input images and corresponding ground truth annotations that define the desired segmentation.A suitable deep neural network architecture, such as U-Net (Ronneberger et al 2015) or Mask R-CNN (Kaiming et al 2017), is then chosen or designed specifically for the segmentation task.The network is trained using the labelled dataset, with its parameters optimized iteratively to minimise a chosen loss function, such as pixel-wise cross-entropy or Dice coefficient, which quantifies the dissimilarity between predicted segmentations and the ground truth.
Validation is performed using a separate dataset to assess the network's performance and guide any necessary fine-tuning.Once trained, the network is ready for inference, where it takes unseen input images and produces segmentation predictions by applying the learned patterns and features.
Deep learning-based segmentation has demonstrated remarkable capabilities in various fields, including medical imaging, object detection, and semantic segmentation.Its ability to automatically learn relevant features from large datasets has significantly advanced the accuracy and efficiency of segmentation tasks, leading to important applications in computer vision research and real-world applications.
Here, we focus on a few different type of network architectures used specifically for medical image segmentation.Not focusing too much on other factors such as type of learning or loss functions, the following section provides a low level overview of a few of these networks, listing both 2D and 3D architectures.
Originally proposed in 2015, U-net is a convolutional neural network (CNN) developed for 2D biomedical image segmentation (Ronneberger et al 2015).The CNN has a modified architecture, adopting a symmetrical structure and skip connections aimed at allowing for optimal model training on medical image datasets.The networks popularity can be deduced from its ability to learn segmentation in an end-to-end setting, ability to precisely localise and distinguish borders and work well with very few annotated images.Currently, U-net has become the standard for most medical image segmentation tasks and the backbone from which several other popular architectures are structured (Wang et al 2022).
Most recently, a complete segmentation of the vocal tract was done to delineate 4 different articulators and the vocal tract (Ruthven et al 2021).A dataset of five participants was used, each subject counting from numbers one through to ten in British English whilst in a RtMRI machine.Between the five participants, there were a total of 392 MR images (or frames) which were segmented by a radiologist.The paper successfully presented an automatic method to fully segment multiple groups of articulators and the vocal tract using a U-net like framework and additionally provide a novel clinically relevant metric for assessing the accuracy of vocal tract and articulator segmentations.Although generalisability was noted to be good, the work stated that the model performed less favourably in preserving airway gaps between articulators, especially in the case of soft palate closures in instances where the ground truth data suggested the space was open.Larger classes provided better dice coefficient and general Hausdorff distances than those that were smaller, as could be expected.Future work requires addressing these factors and potentially using a larger range of vocal tract configurations.It is additionally worth noting that the model's applicability to recordings taken from other MRI machines is unclear, and it is likely it will not perform well for images not taken in the same MRI machine as the paper used.
Medical image data produced, being either CT or MRI are often taken in 3D.To take advantage of these high-dimensional data sets, Çiçek et al (2016) furthered the u-net architecture to be applied to 3D data, proposing an architecture apt at performing segmentation directly, named 3D U-net.However, due to computational limitations posed when using such a dataset, the number of down-sampling steps had to be reduced, resulting in a model with a reduced segmentation accuracy.
V-Net worked around this problem by employing residual connections to create a deeper network with more down-sampling steps.Although the network performance did improve, the 3D segmentation network, and others developed after, face an underlying issue surrounding the need to high computational power and GPU memory, often not available during the training process.

Conclusion
To conclude, we have thoroughly explored the various oral actions the vocal tract goes through and how this pertains to the articulatory function.A detailed summation of the methods of articulatory recording has been provided, addressing their advantages and limitations, as well as other metrics relevant to their use.Cross domain image-to-image translation seems viable, constraints surrounding datasets can be worked around, whilst the use of image segmentation shows promising applications for processing downstream tasks.The overarching problems associated with these systems have been talked about in the discussion section, with the subsequent sub-sections providing the foundations for spear heading solutions viable in addressing problems related to speech analysis and speech correction, mastication and, more broadly, oral processing.Throughout the paper, particularly in reference to the 12 modelling techniques, we have seen the potential clinical significance of a technique capable of modelling the complete mouth.In its simplest form, a viable way to model the complete mouth will see down steam applications in speech correction and designing foods for the aging population.In the dental field we would be able to gain information about patient's oral actions that would become part of creating a personalised dental treatment plan.In its initial state, image-to-image translation holds the potential to facilitate seamless transitions between diverse MRI weightings.Given the variability of MRI machines in hospitals, leveraging this technology could prove instrumental in enhancing the fidelity of vocal tract MRI frames.
David Bradshaw and Maria-Teresa Addison (Haleon PLC) are gratefully acknowledged for their insights and help with conceptualising this research.
below shows the MRI view for the external view frame of figure 1.
: M = Input from internal view (rtMRI) I = Input form external view (Camera) Y = Oral actions D = Interrelationship between M and I D′ = Interrelationship between M and I over time.For any given frame pair, we can state D to be the relationship between M and I, represented as: We can additionally observe the relationship between M and I in the context of a whole oral action, represented through multiple consecutive frames (a video).A whole sentence in the form of a video would include temporal information present over multiple frames.With Y as oral actions and D' as the relationship between M and I over time.D' can be represented as: D M, I, Y .{ } ¢ =

Figure 1 .
Figure 1.Image illustrating the coronal (left) and sagittal (right) view of the face (Scholes and Skipper 2020), (Reproduced from Scholes and Skipper (2020) under CC BY license).

Figure 2 .
Figure 2. Image illustrating a MRI sagittal view, showing all the articulators forming the vocal tract.(Scholes and Skipper 2020), (Reproduced from Scholes and Skipper (2020) under CC BY license).

Figure 3 .
Figure 3.A still frame of a 2D rtMRI recording (left) alongside a manually annotated segmented form of the image, illustrating each articulator as an instance (right).(Reproduced from Ruthven et al (2021) under CC BY 4.0 Licence).

Table 1 .
A table providing information regarding the articulatory structure, including sample consonants and voiced/voiceless fricatives.
The consonant [d]-in DogThe consonant [k]-in KingGlottis /Larynx Active Glottal-sounds made using the glottis as primary articulation.The consonant {ɦ}-in beHindThe consonant [h]-in High Tongue Mobile Retroflex-sound made when the tongue has a flat, concave or curled shape; articulated between alveolar ridge and hard palate.The consonant [ɺ]-in RestThe consonant [ʂ]-in Swedish word foRS (meaning 'rapids')

Table 2 .
(Lim et al 2021)ting the capabilities of each articulatory recording system, with MRI images highlighting the articulators each method captures.Images sourced from(Lim et al 2021).

Table 3 .
A table covering nine different factors surrounding the use of electropalatography in capturing oral movements.

Table 4 .
A table covering nine different factors surrounding the use of ultrasound in capturing oral movements.
(Dromey et al 2018)1)technique can experience double edges, reflections, and general poor quality images generated(Stone 2005).Examples of use Bennett et al (2017) presents an ultrasound analysis of the secondary palatalisation constant in Irish, analysing data from 5 different Irish speakers.Overall safety 'Ultrasound, however, is becoming cheaper, is safe, is easy to set up and use, and is able to provide realtime images of the whole tongue during speech.'(Wilson2014).Sampling Rate Ultrasounds with frequencies up to 10 MHz are usually used in medical practice (Reda et al 2021).etal(2022),titled,'TongueContour Tracking and Segmentation in Lingual Ultrasound for Speech Recognition: A Review' Examples of Available datasets UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions (Eshky et al 2018).Comparing articulatory images: An MRI / Ultrasound Tongue Image database (Cleland et al 2011).Examples of use Hoke et al (2019) used EMA to investigate the effects denture adhesives have in minimising denture displacement while chewing.They successfully used EMA to demonstrate that the use of denture adhesives statistically reduces the likelihood of denture micro movements.detail.EMA is one of the few methods that illustrate oral gestures continuously, as opposed to technologies such as EPG that only illustrate the motion of the tongue when in contact with the palate.This makes it possible to record multiple articulators simultaneously and thus observe inter-articulatory behaviour(Rebernik et al 2021).The sensor pads placed in the mouth are small, taking only around 10minutes for an adult to become adapted(Dromey et al 2018).
Table 5.A table covering nine different factors surrounding the use of electromagnetic articulography in capturing oral movements.Advantages Data collected within the oral cavity has high spatial accuracy and temporal resolution, thus producing fairly accurate information on articulatory gestures.EMA allows for the measuring of multiple articulators at once (Rebernik et al 2021).Disadvantages Sensor positioning is limited to the anterior oral tract, with velum tracking not possible without causing significant discomfort to subjects (Rebernik et al 2021).Sensors cannot be placed too close to each other without disturbing measurement accuracy.Method does not allow for high-quality simultaneous recording of auditory data since sensors attached to the tongue change acoustics (Meenakshi et al 2014); however, it does afford some speech production with 'moderate interference (Hasegawa-Johnson 1998)' (Dromey et al 2018) Rebernik et al (2021), titled 'A review of data collection practices using electromagnetic articulography'.Examples of Available datasets The Electromagnetic Articulography Mandarin Accented English (EMA-MAE) corpus of acoustic and 3D articulatory kinematic data (Ji et al 2014).Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC) (Narayanan et al 2014).greater

Table 6 .
A table covering nine different factors surrounding the use of static palatography in capturing oral movements.

Table 7 .
A table covering nine different factors surrounding the use of video and optical tracking in capturing oral movements.
(Chi et al 2021)d photoglottographyThis section defines two techniques that both pertain to study of glottal activity; the techniques can quantify the movements of the tongue, velum and larynx.As detailed in table 10, endoscopy works by inserting a laryngoscope down a patient's throat, to observe glottal activity.The laryngoscope is a thin tube, with attached to it a video camera and light.Since the presence of the endoscope hinders one's ability to speak, the technique is limited to only allowing the study of certain sounds, such as prolonged vowels.Photoglottography (PGG) is a system developed by Sonesson(Chi et al 2021), also used in studying glottal behaviours.This method is rather non-invasive

Table 8 .
A table covering nine different factors surrounding the use of x-ray microbeam in capturing oral movements.

Table 9 .
A table covering nine different factors surrounding the use of electroglottography in capturing oral movements.

Table 10 .
A table covering nine different factors surrounding the use of endoscopy and photoglottography in capturing oral movements.

Table 11 .
A table covering nine different factors surrounding the use of oral airflow/pressure in capturing oral movements.

Table 12 .
A table covering nine different factors surrounding the use of nasal airflow in capturing oral movements.

Table 13 .
A table covering nine different factors surrounding the use of real time magnetic resource imaging in capturing oral movements.

Table 14 .
A table covering nine different factors surrounding the use of x-rays in capturing oral movements.MRI owing to the distinctive multiarticulator capabilities it offers.Currently, four publicly accessible datasets have been released (Douros et al 2019, Scholes and Skipper 2020, Lim et al 2021, Ruthven et al 2021).Each dataset is accompanied by transparent protocols detailing the procedures during data collection, specifying the MRI machine utilised, and the specific coil configurations employed.Moreover, datasets are available upon request (Birkholz et al 2020, Dediu et al 2022, Isaieva (Munhall et al 1995): Consists of 25 films (totalling 55 min) of x-ray footage converted from film collected in the 1970s.The data set contains a total of 14 Canadian English and French speakers.