Name-based Approach to Identify Suspicious Return Statements

Identifier names account for 70 percent of source code in characters. Meaningful identifier names convey valuable natural language information of the corresponding software entities, and thus they can be exploited to analyse the program and comprehend the semantics of it. However, existing bug detection approaches often ignore such natural language information and semantics encoded in identifier names. To this end, we propose a name-based approach to identify suspicious return statements. The recent study suggests that source code elements are often lexically similar to their corresponding semantic related elements. Based on this assumption, we design a sequence of heuristic rules to determine whether the return statement of a given method matches the signature of it. Evaluations on 100 open-source Java projects suggest that the proposed approach detects 16 out of 60 real return bugs. The precision and recall of the proposed approach are 31% and 27%, respectively.


Introduction
Programmers often encode useful natural language information in identifier names of the corresponding software entities. Such identifier names in source code can be exploited to analyse the semantics of programs and help program comprehension. However, existing program analysis, especially bug detection tools, often ignore the natural language information encoded in identifier names when they conduct program analysis.
The reason why bug detection tools ignore identifier names are explained as follows. First, it is challenging to mine information embedded in identifier names because it involves complicated natural language understanding. Second, it is difficult to decide whether a given code snippet is buggy or not just based on the textual identifier names.
To exploit the useful semantic information encoded in identifier names, this paper proposes a name-based approach to detect suspicious return statements in method declarations. The rationale of the approach is that the mismatch between the signature of a given method and the return statement of the method may suggest suspicious return statements. We formulate the problem of deciding whether the return statement of a given method is correct or incorrect into the problem of judging whether the return statement matches the signature of the given method.
To evaluate the proposed approach, we extract 449,656 methods with return statements from 100 open-source Java applications. By parsing the commit history of source code and checking the commits fixing return statements, we find that there are 16 incorrect return statements in the real source code. By comparing such incorrect return statements against the suspicious return statements warnings identified by the proposed approach, we find that the proposed approach can detect most of suspicious buggy return statements, with the precision and recall of 31% and 27%. In this paper, we make the following contributions: First, we propose a name-based approach to detect suspicious return statements by designing finetuned heuristic rules. To the best of our knowledge, we are the first one to name-based and heuristicbased approach in identifying code defect.
Second, evaluations on 100 Java open-source applications suggest that the proposed approach is effective in detect buggy code with the recall of 27%.
Third, we implement the proposed approach into a publicly available bug detector NBBugs, which is available as an open-source tool: https://github.com/d12126977/NBBugs.

Bug Detection
Murali et al. [12] train a RNN model to detect incorrect API usage. Choi et al. [17] train a memory network to detect buffer overruns by leveraging raw source code. Wang et al. [18] train a deep belief network to predict code defects by extracting semantic features from the source code abstract syntax tree. Aftandilian et al. [1] and Hovemeyer et al. [2] implement bug detection tools based on static program analysis. However, such approaches ignore information conveyed in identifiers. Michael et al. [4] present a learning-based and name-based approach to automatically detect accidentally swapped function arguments, incorrect binary operators, and incorrect operands in binary operations.

Name-based Approaches
Identifiers account for 70% of source code in characters. They convey important natural language information and semantics of the source code. Studies found that identifier are crucial source for program comprehension and maintenance. Researchers exploit information embedded in identifiers to detect inconsistent method names [3] and mismatch between the name of software entity and the body of software entity [8] [9] [10] [11], to mine API specifications [6] [7], to recommend argument names [4] and method names [5].

Approach
In this section, we present a name-based approach to detect suspicious return statements in method declarations. The rationale of the proposed approach is that return statement specify the output of the method, and thus a significant mismatch between the return statement of a given method and the signature of the method may suggest that the return statement is incorrect. Consequently, we propose a name-based approach to identify suspicious return statements by designing fine-tuned heuristic rules based on analyzing source code with big corpus.

Overview
An overview of the proposed approach is presented in Fig.1. The proposed approach works as follows:  First, it extracts identifiers from the given source code statically.  Second, it computes the lexical similarities between identifiers.  Third, it applies a sequence of heuristic rules to match the signatures and return statements of given methods.
 Finally, it identifies suspicious return statements and reports warning.

Identifier Extraction
For a given declared method, we extract the following related identifier names by statically parsing the ASTs of Java files using the Eclipse plug-in tool JDT[16]: (1) method name.
(3) candidate return statements. A candidate return statement refers to any expression that is accessible in the method declaration, can be used as the return statement, and without introducing any compiling errors. The candidate return statements we extract include the following expressions: (1) names of methods declared in the enclosing class.
(2) global variables accessible in the given method and compatible with the return type.
(3) local variables declared in the given method and compatible with the return type.
(4) formal parameters of the given method and compatible with the return type.
(5) Default values of the return type.

Lexical Similarity
We compute the lexical similarity between the method name and other identifier names extracted for the given method by leveraging Jaccard similarity metrics [15]. Assuming that identifier names follow the camel case naming convention, we first split each identifier into a sequence of tokens, then compute lexical similarity based on the Jaccard similarity as follows: , ∩ ∪ Where mName is the name of a given method m, rName is the return statement or a candidate return value of the given method m, and tokens(mName) and tokens(rName) are the sequence of tokens split from identifier mName and rName. The input of the algorithm includes the following information of the given method, i.e., the method name, the return statement, and all candidate return statements. If the lexical similarity between the method name and any candidate return is significantly greater than (above a threshold) the lexical similarity between the method name and the real return statement, i.e., the return statement does not match the signature of the method, the proposed approach will report a warning of suspicious return by returning boolean value 1. If the return statement matches the signature best, the proposed approach will return 0, which means that the return value is correct.

Research Questions
 RQ1: Is the proposed approach effective in detecting incorrect return statements?  RQ3: How long does it take the proposed approach to identify incorrect return statements. i.e., is the proposed approach scalable?

Process
We evaluate the proposed approach on 100 open-source Java applications. We select such applications as the evaluation dataset because of the following reasons. First, open-source code is publically available and thus other researchers can easily repeat our experiments. Second, Java language is one of the most popular programming languages, which ranks the 1st position in the TIOBE Index for 2020. We conduct the evaluation as follows. First, we filter out commits related to fixing return statements by parsing commit history which includes "return bug", "fix return", "return error", "return statement", or "return value". Second, we manually check the commits to collect real incorrect return statements. Third, we exploit the proposed approach to detect suspicious return statements and compare reported warnings against real return bugs.

Scalability
To evaluate the efficiency of the proposed approach, we investigate the execution time of the proposed approach to detect bugs from subject projects. The evaluation is conducted on a personal computer with the following configuration: Intel Core i9-9700, 16GB RAM, and Windows 10. We found that it takes around 30 seconds to execute the proposed approach for each project on average. Fig. 1 presents two incorrect return statements detected by the proposed approach, which is detected from the application netty (commit id d9c700e9fed0ea964eeabc46809aeb76425c2a5f) and platform_frameworks_base(commit id b294eac086e7dba9c13cfb6cb4e24e39a1b88631). For example, the method getCloseNotifyTimeoutMillis should return the field closeNotifyTimeoutMillis, however it returned the field handshakeTimeoutMillis mistakenly. 1) Example from netty (commit d9c700e9fed0ea964eeabc46809aeb76425c2a5f)

Examples of detected return defects.
2) Example from platform_frameworks_base (commit b294eac086e7dba9c13cfb6cb4e24e39a1b88631) Fig.2 Example of return defects detected by the proposed approach

5.Conclusions
In this paper, we propose the first name-based approach to identify suspicious return statements by leveraging heuristic rules. The rationale of the proposed approach is that semantically related code elements are often lexically similar, as a result a significant mismatch between the return statement of a given method and the name of the method may suggest a return defect. Consequently, we design a series of fine-tuned heuristic rules to detect such defect. Evaluation on 100 open-source Java applications suggests that the proposed approach is effective in identifying suspicious return statements.