An evaluation of approaches for enhancing inductive learning with a transductive view

Availability of enough labeled data is a challenge for most inductive learners who try to generalize based on limited labeled dataset. A traditional semi-supervised approach for the same problem attempts to approach it by methods such as wrapping multiple inductive learners on derived pseudo-labels, unsupervised feature extraction or suitable modification of the objective function. In this work, a simple approach is adopted whereby an inductive learner is enhanced by suitably enabling it with a transductive view of the data. The experiments, though conducted on a small dataset, successfully provide few insights i.e. transductive view benefits an inductive learner, a transductive view that considers both attribute and relations is more effective than one that considers either attributes or relations and graph convolution based embedding algorithms effectively captures the information from transductive views compared to popular knowledge embedding approaches.


Introduction
Semi-supervised learning (SSL) uses both labeled and unlabeled data to perform a learning task. On the contrary, traditional supervised learning builds a model from labeled dataset to be able to generalize for unlabeled data. For SSL to work, one needs certain assumptions to hold together the labeled and unlabeled dataset. These assumptions [1] are primarily around smoothness, cluster or manifold. The existing SSL algorithms can be divided into two major categories i.e. inductive and transductive. In the inductive type SSL, the traditional inductive learning algorithm is extended by pre-processing, objective function or via pseudo labeling. A wrapper method such as co-training [2] is an inductive SSL approach that uses a set of learners (may be supervised) with pseudo labelling technique. Similarly, an unsupervised pre-processing approach may use clustering on labeled and unlabeled dataset to find useful features for supervised learner. An approach like semi-supervised support vector machine (S3VM) essentially constraints the maximum margin hyper-plane using both labeled and unlabeled data. From this perspective, the inductive semi-supervised approaches do not directly combine an inductive and transductive learner.
On the other hand, transductive SSL requires prior availability of all test and train data and has to be re-trained for unseen samples. Transductive learning was first defined by Vapnik et al. [3] in 1998. Later on it was established [4], though empirically, that transductive learning can become a more efficient learning method compared to inductive learning. For transductive SSL, information has to be propagated through connections between the data points and this typically happens with a graph based model. The label propagation algorithm uses smoothness assumption that warrants samples with similar labels to be close neighbours in the sample space. A learning algorithm such as graph convolution network allows the label influence to propagate through graph convolution.
A transductive SSL approach such as graph convolution offers some advantages. Firsly, since a learner is not confined to only the labeled data, it can better handle outlier. Secondly, due to the convolution operation, even the unlabeled data contributes in the representations being learnt. Finally, transductive learning has the advantage of capturing more domain specific information as it can look at the full labeled and unlabeled dataset.
Transductive and inductive learning can possibly be combined for higher learning performance. This work is an attempt to evaluate the relative performance of different hybrid approaches where both types of learning are combined.

Existing work
Transductive learners get access to the larger dataset and more domain information while the inductive learners may be starved of such additional contexts. In this paper, combined set of unlabeled and labeled dataset is referred to as the transductive learner's view whereas the set of labeled dataset is referred to as the inductive learner's view. The attempts to combine both the views can be divided into several categories.

Enhancing text embedding with knowledge
One possible approach in combining the transductive and inductive view of the dataset is that of using an inductive learning methodology with pre-training with the transductive view. This benefits the inductive learner with the transductive view. This approach is attempted by accessing domain knowledge from a knowledgebase by several efforts. Classification of clinical text is notoriously difficult and Yao et al. [5] used CNN based text classification by a model that jointly learns using word embedding and knowledge based rules. Ifrim et al. [6] addressed lack of labeled data by augmenting the corpora with knowledge from different sources such as WordNet along with "topic concept" and "concept word" spaces using topic model. Annervaz K. M. et al. [7] used information derived from related fact triples in external knowledge graph to enhance performance of text classification tasks. Availability of additional information in knowlegebase has been utilized in other domains as well such as knowledge based question answering [8], knowledge based summarization [9], knowledge assisted recommendation [10,11] etc. Zhang et al. [12] did zero shot learning for text classification using a dual strategy of text augmentation supplemented with semantic knowledge from various sources such as word embedding, general knowledge graph, class hierarchy etc. Sinora et al. [13] followed a similar strategy of enhancing the word embedding with embeddings of related synsets (BabelNet [14]) and related Wikipedia concepts (NASARI [15]) to deliver knowledge-enhanced document embedding. Wang et al. [16] addressed the challenge of short text classification by adding embedding of explicit knowledge to the embedding of implicit knowledge captured by text embedding. Xiao Ding et al. [17] did event driven stock prediction task by using knowledge embedding from external knowledge sources. A cryptic event descriptor has hardly enough information to deliver good prediction but with access to external knowledge, this task becomes doable.
Researchers have developed ConceptNet [18] that essentially uses a combined multilingual knowledge graph to deliver commonsense enhanced word embedding for tasks reliant on word relatedness. The knowledge embedding provided by ConceptNet is somewhat different from classical deterministic knowledge graph based embedding. A common sense knowledge-enhanced word embedding such as this can deliver additional semantic context based on principles of distributional semantics.

Text and knowledge mutual attention
In most of the researches mentioned above, joint learning strategies are mostly based on concatenation of explicit and implicit information. However a sample text may be related to several concepts and each concept may not be equally important for the inductive learning task. Similarly, for the relevant concept for the inductive learning task, each part of the text may not be equally relevant. Just blind concatenation of text and candidate concepts does not address this aspect. Hence, few researchers have attempted to address this using variants of attention schemes. Chen et al. [19] implemented two mutual attention schemes i.e. text to concept and concept to set of concepts. These two attention weights were combined to get the final attention weight of the concept which is then used to get a weighted sum of the concept vector to be concatenated with the embedding of the text. We can see a similar attention based strategy employed in several works [20,21] where mutual attention between knowledge and text are either used to do a knowledge graph completion task or a suitable learning task on text side such as text classification, relationship extraction etc. However, the extent of success of such an approach depends on richness of content on both text side (inductive learner's view) and knowledge side (transductive learner's view).

Alignment of latent spaces of text and knowledge
Comparatively, there are fewer researches that attempt to address the difficult problem of aligning the two separate latent spaces of knowledge and text. There are several approaches i.e. enhancing text embedding with knowledge [22,23], enhancing knowledge embedding with text [24,25] and learning them together by jointly training the model [26,27,28,29] on both text and knowledge. However, all these researches were done with a very sizable dataset where both labeled dataset on the inductive learner side and domain information on the transductive learner side were plenty. The dataset chosen for this work is rather small and the labelled data on the inductive learner side is too cryptic to adopt such a strategy.

Knowledge embedding derived from Knowledge graph
For any such joint learning initiative, unless a knowledge base or a knowledge graph is already available, the necessary next step is to create a knowledge graph and extract the embedding that captures the knowledge context of the entity in question. A knowledge graph is an aggregation of many tuples each of which describes a relation between two entities. There are existing knowledge graph algorithms that essentially provide an embedding for the entities as well as the relations in the graph. So, the joint learning task is then to extract the relevant embeddings of entities and relations from the knowledge graph and use them in the desired manner in the learning tasks. There has been significant research around knowledge graph embedding and there are several existing embedding strategies to come up with low dimensional representation of entities as well as relations in the knowledge graph. The most popular approach uses real valued point wise space that is describable via tensors. TransE [30] assumes entities and relations are in the same representation space and follows a translational relationship. The training loss essentially constraints that related entities should be nearby whereas unrelated entities should be farther apart in this space. There are other variants of TransE that adopts variants of these assumptions such as entity and relation need not be in the same representation space. Instead of having real valued space, ComplEx [31] assumed a complex latent space where each entity and relation may have a real value and an imaginary component. The other important difference is the fact that ComplEx uses latent semantic similarity instead of translational distance. There are other embedding methods that use Gaussian distribution, manifold etc. Apart from these structural modeling approaches, there are attempts such as DKRL [32] to have knowledge representation by using auxiliary textual information. These knowledge graph embedding algorithms can be used in deriving embedding of entities and relations in any knowledge graph that is already constructed.

Graph convolution as an approach for knowledge embedding
The algorithms in the area of graph representation learning can deliver the embedding for nodes, edges and the graph in a low dimensional vector format that is amenable to standard machine learning techniques. There are several categories of graph representation learning techniques: (i) Random walk and skip-gram model based : DeepWalk [33], Node2Vec [34] etc. essentially capture the graph structure based information from different orders of neighbourhood. (ii) Matrix factorization based approaches such as HOPE [35] (iii) Auto-encoder based approaches such as SDNE [36] (iv) Several spectral and spatial convolutional approaches using graph neural network such as graph convolution network [37] and graphSAGE [38].
Graph based learning algorithms such as graph convolution network (GCN) have emerged as a feasible transductive learning option in the last few years. The obvious advantage is in the way the unlabeled data can influence the label prediction task through message passing. It is now getting used in text domain in learning tasks such as text classification [39], sentiment and emotion prediction [40,41], crisis prediction [42] using tweets specifically for applications having insufficient labeled data. The graphSAGE [38] algorithm is very similar to GCN except the critical step of neighbourhood sampling which makes it more scalable than GCN. Both GCN and GraphSAGE can be used inductively and transductively. In this work, they are used transductively to obtain entity embedding in a knowledge graph. The knowledge embedding provided by TransE and its variants are essentially structural embedding. These knowledge graph embedding algorithms do not necessarily use a message passing approach that graph convolution uses. In that sense, graph convolution can offer another approach to deliver knowledge embedding using a constructed knowledge graph. Very recently in 2021, Nasrullah Sheikh et al. [43] have used relation-aware graph attention model to provide a knowledge embedding with superlative performance.
In this work, though such a graph attention model is not used, a GraphSAGE based knowledge embedding is attempted offering superlative performance in knowledge graph benchmark tasks. In this work, TransE, ComplEx, Conceptnet and GraphSAGE are used as various attempts to derive the transductive learner's view.

Dataset requirement
Typically in an inductive learning setting, entities are classified based on a labeled data. The labeled dataset typically contains limited information such as entity labels or tags. This is a typical setting in a recommendation system. On the other hand, a knowledge graph capturing entity relations in a related domain is an example of a transductive view. This view can be used by a transductive learner when it is asked to learn classes of these entities. An inductive learner on the other hand may try to predict labels purely based on entity labels. This is a scenario that can facilitate transductive-inductive hybrid learning opportunity. A dataset for this learning problem setup is one that offers relatively cryptic information as entity labels.

Chosen dataset
In this work, a recently published knowledge-guided recommendation dataset MindReader [44] is chosen as it fits the dataset requirement outlined above. Typically a recommendation system is built on the basis of explicit ratings and implicit feedback (purchase history, browsing history of using ratings of non-recommendable items (such as actors) sourced from this KG in improving recommendation quality.
In this work, MindReader dataset has not been used to set up a recommendation problem. Instead, this dataset has been used to set up a sentiment classification problem (like, dislike) for the movies for different standalone as well as hybrid approaches for inductive as well as transductive learning. The dataset, as shown in Figure 1, is provided in the following form: (i) Ratings (sentiment label) of recommendable items described by unique id, URL to wikidata and item flag (ii) Items are associated with KG triples (head URL, relation, tail URL) (iii) Associated KG entities (URL, name, labels) The Figure 2 shows a dataset sample extracted from MindReader. The labels provided for KG entities are cryptic whereas more information can be accessed in the domain KG. The URLs provided can be used to access the Wikidata details. The dataset provides sentiment by multiple users. So to reduce the variability of sentiment due to different users, a set of 574 items (including both recommendable and non-recommendable) with sentiment given by a single user is chosen. This is a very representative scenario that is encountered in other domains such as healthcare where the domain related historical knowledge is very vital for successful diagnosis.

Methodology
The learning methodology used is inspired by co-training that uses two supervised learners with complimentary views of the same data. In co-training, the most confident pseudo-labels of each learner is added to the labeled dataset pool used by the other learner in each iteration. The methodology used here can be thought of as a single iteration co-training where the participant learners can be employing inductive and transductive approaches. In this work, various standalone views are investigated for comparing the relative performance with hybrid approaches: (i) Inductive view: In this, learner can only learn from the labeled dataset that consists of entity (cryptic entity labels as descriptors of the entity) and entity class (like, dislike). (ii) Transductive view: In this, the inductive learner can learn from the entity perspective in the related domain. This can be arrived at in two ways.
• Transductive view based on knowledge graph: Knowledge graph (KG) that has both labeled and unlabeled entities can be used to get a suitable KG embedding encapsulating the transductive view. Apart from standalone approaches mentioned above, learning from the inductive and transductive views are combined. Two main approaches are used for combining inductive and transductive views: (i) Concatenation of views: In this, the embedding derived from each are simply concatenated.
(ii) Cross attention approach: In this, attention is used to get a knowledge attended text embedding. Consider the case where s t is the embedding for the t th word for the text description of an entity where the length of the description is |S|. Now, the entity representation vector is |S| t=1 β t s t where β t should measure the importance of the t th word as all words are not equally important. This relative importance of the word s t is determined by the knowledge guidance. The text embedding is s t ∈ k c and corresponding KG embedding of entity is g ∈ k w . At this place, knowledge guidance is used with a trainable affinity matrix : W s ∈ kw×kc and bias b c ∈ kw .
In this work, two simplifying assumptions are made. The knowledge graph from which the dataset is extracted is assumed to be homogeneous (no heterogeneous graph is attempted ) and all relations are assumed to be similar (no multi-relational graph is attempted).

Results and analysis
The table 1 lists the results of the various experiments performed with this dataset. Following insights can be derived out of the results : (i) A standalone approach of transductive view using structural information alone (Node2Vec) is as bad as a standalone approach of inductive view (Glove) purely based on labels. (ii) A standalone approach of transductive view using structural oriented knowledge embedding(TransE) puts up a comparable performance to that of Node2Vec and this is expected. (iii) A standalone approach of inductive view that is augmented by common sense knowledge (ConceptNet) is marginally better than the performance put up by Glove embedding due to the added knowledge component. (iv) A transductive view consisting of domain knowledge has more impact than enhancing the text embedding with common sense knowledge using ConceptNet. (v) Whenever transductive and inductive views are combined, learning efficiency has exceeded that of standalone approach. (vi) A knowledge embedding algorithm that uses the transductive view and uses semantic similarity as a basis of embedding loss function will be more helpful than a knowledge embedding algorithm that uses translational loss function. However, this aspect is domain specific. In movie domain, semantic relatedness of entities is likely to have more impact. This need not hold true for all domains.  (vii) GraphSAGE based entity embedding that is derived out of transductive view of the sample space (both attribute and structure) delivers the most competitive performance. Hence, a transductive view of the sample space that considers both the attributes and structures of the entities is more helpful than a transductive view that considers either structure or attribute of entities.

Conclusion and future work
The work documented in this paper investigates different alternatives to traditional semisupervised learning by augmenting the inductive learner with the transductive view of the sample space in the absence of enough labeled data. Though the conducted experiment is limited in scope and size of the dataset, it does conclusively prove that a transductive view of the sample space benefits an inductive learner whenever combined together with an inductive view. It also provides an interesting insight that while considering the transductive view, both attributes and relations of the related entities are useful and a transductive approach based on graph convolution based algorithm such as graphSAGE can offer a really competitive performance that may be even better than other good hybrid learning strategies. As a next step, several things can be pursued. Firstly, the dataset used in this work is limited in size. To address that, the experiments carried out will be redone on a similarly suitable dataset which is much larger in scale. Secondly, we plan to continue the work in a domain where the lack of labeled data and need for a transductive view of the related domain are very prominent i.e. healthcare. In healthcare, for any good diagnosis, the labeled information as obtained from case-history is hardly sufficient and the doctors have to resort to their past experiences that can be successfully modeled as a domain knowledge graph. Hence, the next phase of the research will be on a suitable dataset from healthcare domain. Finally, as the next level model, while the inductive view will remain limited in scope, we will improve the transductive view by removing the simplifying assumptions mentioned in the methodology by using a multi-relational heterogeneous knowledge graph.