A Cluster-based Approach for Finding Domain wise Experts in Community Question Answering System

. Community Question Answering (CQA) systems is an emerging web-based information service. CQA enables web users to get precise answers to questions from experts of the specific domain.CQA is used in wide areas such as biomedicine, information technology, tourism, etc. This paper focused on finding experts in community question answering system using unsupervised machine learning technique. Our Proposed system consists of three phases namely i) Clustering tags ii) Determining Experts for the unanswered questions iii) Finding experts for the given Question. By doing the tag analysis process, identified similar tags in a particular domain, and formed a cluster. By doing question analysis, we found the unanswered questions in each domain and identified the experts. For the given question, suggest the expert by doing pattern matching technique. The results section proves that our system predicts the experts for the given question with good accuracy.


Introduction
Question answering is an information retrieval system that aims to provide direct answers for the query from the structured database of knowledge or information. Generally, two types of question answering (QA) systems exist on the web. i) Closed-domain ii) Open domain. The closed domain QA system deals with questions beneath a specific domain. The open-domain QA system deals with questions in a wide area. An open-domain QA system is developed in Menaha et al. [11] using web snippets as a knowledge base. Another open domain voice-based QA system is developed in Maheshwari et al. [9] by exploiting NLP techniques and used Web-snippets as a knowledge source. In recent times, the QA system is developed for many real-time applications such as medical, e-learning, project management, community question answering (CQA), etc. In particular, CQA forums are gaining popularity online.
CQA websites help people to ask their queries instead of searching for an answer in a search engine. While searching a question in the search engine gives millions of results as web pages but the CQA website gives the direct and relevant answer to the posted queries. So the user's time is saved for exploring the data. Nowadays, CQA websites like Yahoo Answers, Quora, StackOverflow, StackExchange, Wikianswers are used by many users to search and explore knowledge. For example, the users can ask questions in the information technology domain and the experts post their answers for the queries asked by other users.
Usually, in CQA websites, the user can ask their question and n number of experts will post an answer for that question based on their knowledge in that domain. The user who reads that answer can provide their vote in two forms namely upvote and downvote. For the most relevant answers, it is upvoted, whereas the less relevant or irrelevant answer is downvoted by the user. Based on the voting, the most relevant answer can be known to the user easily.
In CQA websites, previously asked questions and answers are available in archives so that the user can find the answer sooner instead of waiting for other users to post the answers. These websites are helpful for the user to find the best answer for their question and find the experts in that domain. In few CQA websites, flags are used to identify whether the user is good or bad but it may cause deviation. For finding the best answer, the user should have more knowledge related to that domain. And for the finding experts, the user needs to analyze the answerer profile.
This work is mainly focusing on finding experts in the CQA website for the identified tag from the given question. Our main contributions in this work are summarized as follows:  As an initial work, a tag analysis process is performed on CQA websites to identify the technical areas where the user asking questions and posting answers. For each such domain, similar tags are identified from the CQA website. These tags are grouped as a cluster. We have formed an n x m matrix to identify the experts for each tag where n denotes the user and m denotes the tag.  In the Question Analysis process, we have collected the total number of answered and unanswered questions for each cluster.  To suggest the expert for the given question, data preprocessing is done, and identified the keyword from the given question. By using a string matching algorithm, the extracted keyword and the tag in the cluster are compared from which the experts are recommended to the user.
This paper is organized as follows. Section 2 presents related work and different approaches followed in CQA. Section 3 gives the problem definition and solution overview. The section 4 describes our proposed methodology. Section 5 shows the results and experiments of the system. Section 6 concludes the proposed system.

Related Work
In the past decade, several works were made by using Community Question Answering sites. Some of the works include routing questions to the experts, finding question quality, determining good users or bad users, finding high-quality answers from archives, identifying best answers, discovering experts, and high-quality posts. Those works are presented in this section.
Baichuan Li et al. [4] presented a method to route questions to the experts of the domain based on their previous answering profile. They have developed a framework called question routing which consists of four phases namely performance profiling, expert estimation, availability estimation, and answerer ranking. Antoaneta Baltadzhieva et al. [3] presented an approach to find the question quality based on features such as tags, question title, question body length. They have used an algorithm called supervised latent dirichlet allocation for classification purposes.
Imrul Kayes et al. [8] classified the user as Good users or bad users based on the flags in the CQA website. Hapnes Toba et al. [7] offered an approach to find high-quality answers from the archives. Maria Soledad Pera et al. [10] presented a method to choose the top-ranked related answer by using QAR.
Wenhao Zheng et al. [14] provided a method to identify the best answer in the CQA forum by using heterogeneous sources. Dallia Elafy et al.
[5] delivered a hybrid model to predict the best answers by using content and non-content features. Roy et al. [13] classified the answers based on their recent occurrences by considering time and post.
Fatemah Riahi et al. [6] presented a method for finding experts using a segmented topic model. Zhenlei Yan et al. [16] provided two models called tensor and topic models to rank the optimal answerer based on the optimized AUC (Area Under the ROC Curve). Yuan Yao et al [15] found the high-quality question and answer by considering the feedback based on voting.  [12] presented a survey on question answering systems with classifications. Annamoradnejad et al. [2] presented a method for predicting features from QA websites using BERT.
From the survey analysis, it is observed that there are no domain wise analysis and experts finding for each domain on the CQA website. Moreover finding experts for unanswered questions in CQA is not yet experimented with. To address the above-said limitations, we have proposed an unsupervised machine learning model to find the experts for the particular domain and found the experts for the unanswered questions.

Problem Definition
This section presents the problem definition and solution overview.

Problem Definition
Definition 1: Identify the domain (d 1 …d n ) where more number of questions and answers are posted from the CQA website. Form the clusters (c 1 …c n ) by grouping relevant tags in that domain.

Definition 2:
For each cluster (c 1 …c n ), Identify the experts by doing the expert analysis process.
Definition 3: Identify the expert for the given question by analyzing the cluster data.

Solution Overview
Initially, domains are identified from the CQA website, for each domain collected relevant tags and formed a cluster for each domain. For each cluster experts are identified by collecting details from the CQA website in the form of a matrix. Each cluster is implemented as a hash map structure i.e. keyvalue pairs where key meant tag and value meant expert. For the given question, the expert is recommended by analyzing cluster data by using the bag of words model. The novelty of our approach is recommending an expert for the given query based on domain.

Methodology
The proposed system is implemented as two phases of work. i) Clustering similar tags ii) Finding experts for the given question. The outline of our system implementation is illustrated in fig 1.

Clustering Similar Tags
Here, the clusters are formed based on three steps of the process: i) Domain Identification ii) Tag Analysis and iii) Expert Analysis. Each step is discussed in this section.

Domain Identification
The Stack Overflow CQA website is used for doing experiments. An analysis is made on the CQA website for finding the technical areas where the users raise more queries and answerer post their answers. The set of identified technical areas are called domains. The identified domain are Java, Web, Python, .Net, SQL, Django, R, Angular, Apple, Excel, OS, ASP.Net, Entity Framework, C, JavaScript, Git, Qt, Database, Android, Ruby, and others.

Tag Analysis
A tag is a keyword or label that categorizes our question with other similar questions. For each domain, we have collected similar tags from Stack Overflow. These tags are grouped as a cluster with the domain name. A cluster may consist of many tags. For example, the cluster Java consists of tags

Expert Analysis
We have analyzed the user pages in Stack Overflow to identify the experts in each domain. This page contains the details of user name, Number of answers posted for a tag by that user, and their score for a tag. By using this data, we have formed a matrix (n x m), where n denotes the user and m denotes the tag which is used by the user. For every single user, the score for each tag is recorded in a matrix format. By using this matrix, we have identified the answerers in that tag and also the top scorer for that tag.
From this matrix, the experts for each tag in a cluster are found based on their score in the CQA site. An expert is a person who has the highest score in that particular tag for the domain. In a cluster, the tags are arranged in the form of key-value pairs, where the key is the tag and the value is the expert for that tag. A single tag can have more than one expert at the same time a person may be an expert in more than one domain.

Finding Experts
Usually, more number of answerers or experts are available in Stack Overflow, but still millions of queries are yet to be answered by the experts. As an example, in the java domain 267380 queries, in the web domain, 690892 queries are yet to be answered. To provide support, we have collected several unanswered questions in each domain from stack overflow.

Question Pre-processing
Question Pre-processing is an important step in the analysis process. The pre-processing technique involves transforming raw data into an understandable format. Normally, real-world data is often incomplete, inconsistent and this may cause errors. Pre-processing is a method of resolving such errors or issues. In pre-processing, the question should be analyzed otherwise it may lead to an improper result. The quality of the question is first and foremost before running an analysis. The higher quality question gives a higher quality answer.
As a question pre-processing process, the given question is tokenized i.e the words are split as tokens. Then the stop words from the question such as what, who, which, where, then, is, are, uses, that, these, those, at, in, on, by, how, whom, if, such, when, this, etc., are removed. Therefore the important words in the given question are left to remain. Then the important keywords are used for mapping with domain in the cluster.

Determining Experts
For finding the expert, we have used the concept of a hash map. Hash map is the collection of keyvalue pairs. It maps keys to values. The hash map names remain the same as the cluster name. There are twenty-one hash maps are available in our system.
In a cluster, each tag is represented as key-value pair whereas key refers to the tag and value refers to the expert for that tag. A single cluster can contain more than one key-value pair. Once the keywords are obtained from the question pre-processing, these keywords are mapped with each key of the cluster. If the keyword matches the key of the cluster, it suggests the expert for that tag as a result.
In Eqn. (1), the function f (e) is used for finding experts from each cluster where k refers to the keyword, c refers to the cluster, and t refers to the tag in each cluster. Here, n refers to the number of keywords, m refers to the number of clusters and n1 refers to the number of tags in each cluster.
In each cluster, the tags are examined and compared with the keywords of the given question. The function f (t) compares the keyword with the tag. In Eqn. (2), k refers to keyword and t refers to tag. If there is a match, then it returns 1 otherwise return 0. If the f (t) value is 1 then the value of the tag is the corresponding expert.

Experiments and Results
We have used the Stack Overflow CQA website for experimental purposes. The system is implemented by using java language. Initially, domain analysis is done to find out the technical areas where more questions are thrown by users and posted by the experts. As a result, we have identified 21 clusters.

Data Collection
For each domain, we have gathered the relevant tags from Stack Overflow. For example, by giving the search query python in Stack Overflow, it returns the relevant tags such as Python, Python -2.7, Python 3.x, Numpy, List, Dictionary, Py.test, Pip, Pandas, DataFrame, etc. By doing this process, we have collected the relevant tags in each domain. Table 1 shows the relevant tags in each domain. In table 1, C1…C21 refers to the number of identified clusters. More than 200 user pages are explored from the user page in Stack Overflow. On each page, the user score is extracted for each tag. By doing this process, we have formed an n * m matrix, where n refers to tag and m refers to users or answerers. From the formulated matrix, the list of answerers for each tag is identified by using their score. The top answerers for each tag are also identified based on their score. Table 2 shows the list of answerer and the top answerer for each tag. In Table 2, A 1 , A 2 …A n refers to the answerers or experts. Fig. 2 shows the tags and the number of answerers in each tag.   Table 3 shows the identified experts for the given question by using our proposed system. We have evinced the results of 10 questions in table 3. As a first step, from the given question, the keyword is extracted by using question pre-processing. For the identified keyword, the cluster is identified by comparing the keyword and the tag in each cluster. If the tag is matched, then the expert is recognized from the cluster.   Tags   php  A5, A52, A55, A59, A61, A72, A79, A80  8  A5   Sql  A38, A53, A61, A69, A79  5  A38 Git A8, A18, A20, A33, A43, A74, A82 7 A43 In every cluster, there are several tags, and the corresponding expert is stored in the form of keyvalue pairs. Fig. (3) is the pictorial representation of the number of identified experts and tags in each cluster.

Conclusion
This paper recommends the expert for the given question by using the cluster approach. In CQA websites, the user posts the question and gets precise and relevant answers. This paper proposed a cluster-based approach for finding experts from CQA website Stack Overflow. Our proposed work involves the process such as question preprocessing, tag identification, cluster analysis, and expert identification. Firstly, domain-wise clusters are created by doing the tag analysis process. Secondly, the keywords are extracted from the given question by doing data preprocessing techniques. The experts are identified by doing string matching between extracted keyword and tags exist in the cluster. A similarity score is computed between the keyword and the experts in a particular domain. Based on the score, the expert is identified from the domain. The person who has the highest score is to be the expert in that domain.