📊 Study on APIzations

This section contains all the data we used to conduct our investigatory study to understand how developers perform APIzations.

We collected examples of manual APIzations by mining both GitHub and StackOverflow. These examples represent such cases in which developers grabbed the snippets from StackOverflow and adapted to their own codebase, published in a public repository on GitHub. We only mined those examples on GitHub for which an explicit link to a StackOverflow is reported as part of the method documentation.

The insights gained from this study led to four common APIzation patterns that establish the foundations of our proposed technique.

Process

The following table shows the steps of our processing method to collect the data for the study.

Step	Description	Files	Questions	Answers	Methods	Snippets	Pairs	Data
1	GitHub archive	`~1,000,000`	–	–	–	–	–	–
2	Filter Java files with an explicit link to StackOverflow	`57,810`	–	–	–	–	–	`gh_files.jsonl.xz`
3	Remove duplicates and links extraction	`29,035`	`11,300`	`4,008`	–	–	–	`gh_files_cleaned.jsonl.xz` `question_ids.csv` `answer_ids.csv`
4	Extract answers text from StackOverflow	–	`10,991`	`64,678`	–	–	–	`so_answers.csv.xz` `so_answers_to_questions.csv.xz`
5	Combine GitHub files and StackOverflow answers	`27,940`	`13,300`	`63,123`	–	–	–	`ghso_files_answers.jsonl.xz`
6	Type 3 clone detection	`330`	–	`199`	`330`	`199`	`330`	`sogh_pairs_clones.json.xz`
7	Data preparation for manual evaluation	–	–	–	–	–	–	`sogh_pairs_clones_files.tar.xz` `sogh_pairs_clones_diffs.tar.xz`
8	Manual selection of APIs	–	–	–	`135`	`85`	`135`	`sogh_pairs_diffs_apizations.tar.xz` `sogh_pairs_diffs_different_semantic.tar.xz` `sogh_pairs_diffs_tests.tar.xz` `sogh_pairs_diffs_false_positives.tar.xz`
9	Coding patterns	–	–	–	–	–	`135`	`parameters_patterns_coding.csv` `return_patterns_coding.csv`
10	Manual application of the patterns	–	–	–	–	–	`135`	`parameters_patterns_analysis.csv` `return_patterns_analysis.csv`

1. GitHub archive

We started from the GitHub archive on Google BigQuery with the dump of May, 2019.

2. Filter Java files with an explicit link to StackOverflow

We queried BigQuery and extracted those Java files having an explicit link to a StackOverflow page, %stackoverflow.com%. In particular, we queried the bigquery-public-data.github_repos.contents table. We produced the file gh_files.jsonl.xz.

3. Remove duplicates and links extraction

We cleaned the data from duplicates and recognized the IDs of StackOverflow questions and answers. We processed gh_files.jsonl.xz through a Python script that saves the filtered GitHub files into gh_files_cleaned.jsonl.xz. We also created the question_ids.csv and answer_ids.csv files, which contain the IDs of the linked questions and answers, respectively. In the case of answer_ids.csv, we found the associated question IDs in the next passages, being an answer ID unique regardless of the question. It might also happen that a single file has multiple StackOverflow links.

4. Extract answers text from StackOverflow

We uploaded answer_ids.csv and question_ids.csv into BigQuery. Then, we ran a query on bigquery-public-data.stackoverflow.posts_answers to retrieve the answer posts on the May, 2019 dump of StackOverflow. The answers IDs correspond to those links in which an answer post was explicitly declared. For this, we used the IDs belonging to answer_ids.csv. We saved the results into a CSV file so_answers.csv.xz. Instead, we ran another query to collect all the answer posts related to the questions we extracted from links, using the IDs in question_ids.csv. We saved the results into a CSV file so_answers_to_questions.csv.xz. Until this moment, we have the information on all the involved questions and possible answers.

5. Combine GitHub files and StackOverflow answers

We combined gh_files_cleaned.jsonl.xz containing the Java files, with the answers contained in so_answers.csv.xz and so_answers_to_questions.csv.xz, using a Python script. We also removed all the files for which we could not obtain all the required information. We save the results into the file ghso_files_answers.jsonl.xz, which contains for every GitHub file all the possible answers.

6. Type 3 clone detection

We processed ghso_files_answers.jsonl.xz to extract public methods from files, only those reporting an explicit link to StackOverflow in the documentation/comment.

We converted all the StackOverflow possible answer texts into code snippets. For each links between a method and possible snippets, we created the pairs of (snippet, method). We also filtered out those snippets with less lines of code than 6.

Finally, we detected the type 3 code clones between snippet and method with a 70% matching threshold. We saved the results in the sogh_pairs_clones.json.xz file.

7. Data preparation for manual evaluation

We prepared the data for manual evaluation. We parsed sogh_pairs_clones.json.xz to prepare a folder for every pair so#{so_answer_id}_gh#{gh_file_id}, containing:

the so#{so_answer_id}.java snippet from StackOverflow
the gh#{gh_file_id}.java matching method code from GitHub

We saved all the files into a folder, compressed into the file sogh_pairs_clones_files.tar.xz.

Then, we created the HTML files, visually showing the differences between the snippet and method, using the diff and diff2html utilities.

We saved all the files into a compressed file sogh_pairs_clones_diffs.tar.xz.

8. Manual selection of APIs

We pruned the pairs due to spurious code clones, those snippets that were included into a larger GitHub method. We manually evaluated and classified all the pairs:

sogh_pairs_diffs_apizations.tar.xz, the final set of pairs used for the analysis, as examples of valid APIzations
sogh_pairs_diffs_different_semantic.tar.xz, the pairs where the StackOverflow snippets were included into more complex methods
sogh_pairs_diffs_tests.tar.xz, not valid pairs because they are test cases
sogh_pairs_diffs_false_positives.tar.xz, not valid examples of reuse

9. Coding patterns

We manually analyzed the content of the APIzations to identify possible patterns. We followed a coding process inspired by grounded theory. The list of codes is described in the following table.

Type	Code
Parameter	Undeclared variable
Parameter	The variable has a constant value
Return	The variable is the latest statement in the code snippet
Return	The variable is used as an argument in a `System.out.println` invocation

At the end of the coding process, we were able to extract the common patterns then used in our automated approach called APIzator. We provide the results of such manual analysis in the files parameters_patterns_coding.csv and return_patterns_coding.csv, for the parameter and return statements transformations, respectively. In the provided files, we specify the agreed pattern in the column code.

10. Manual application of the patterns

As the final step in our study, we tried to manually apply these patterns to the same examples used for observations. This is intended to simulate how our automated approach called APIzator would behave. We provide the results of such manual analysis in the files parameters_patterns_analysis.csv and return_patterns_analysis.csv, for the parameter and return statements transformations, respectively.

In the provided file, we specified the applied pattern in the column human_classification. We indicated the pattern that would be applied by our approach in the column apizator_classification.

The pattern classification is described in the following table.

Classification	Type	Description
none	Parameter	The developer did not convert the variable to a parameter.
custom	Parameter	The developer applied a transformation by using an arbitrary action we were not able to generalize.
PATT-notdecl	Parameter	The developer created a parameter from a variable that is only used, but not declared, in the code snippet.
PATT-const	Parameter	The developer transformed a variable with a hard-coded assignment to a parameter.
none	Return	There are no return statements in the resulting API.
already	Return	The return statement is already declared in the snippet and was not changed by the developer.
custom	Return	The developer indicated a return value by using an arbitrary action we were not able to generalize.
PATT-latest	Return	The developer derived the return statement as the last assignment to a variable in the snippet.
PATT-syso	Return	The developer transformed a print to the system output to the return statement.