📊 Study on APIzations
This section contains all the data we used to conduct our investigatory study to understand how developers perform APIzations.
We collected examples of manual APIzations by mining both GitHub and StackOverflow. These examples represent such cases in which developers grabbed the snippets from StackOverflow and adapted to their own codebase, published in a public repository on GitHub. We only mined those examples on GitHub for which an explicit link to a StackOverflow is reported as part of the method documentation.
The insights gained from this study led to four common APIzation patterns that establish the foundations of our proposed technique.
Process
The following table shows the steps of our processing method to collect the data for the study.
Step | Description | Files | Questions | Answers | Methods | Snippets | Pairs | Data |
---|---|---|---|---|---|---|---|---|
1 | GitHub archive | ~1,000,000 |
– | – | – | – | – | – |
2 | Filter Java files with an explicit link to StackOverflow | 57,810 |
– | – | – | – | – | gh_files.jsonl.xz |
3 | Remove duplicates and links extraction | 29,035 |
11,300 |
4,008 |
– | – | – | gh_files_cleaned.jsonl.xz question_ids.csv answer_ids.csv |
4 | Extract answers text from StackOverflow | – | 10,991 |
64,678 |
– | – | – | so_answers.csv.xz so_answers_to_questions.csv.xz |
5 | Combine GitHub files and StackOverflow answers | 27,940 |
13,300 |
63,123 |
– | – | – | ghso_files_answers.jsonl.xz |
6 | Type 3 clone detection | 330 |
– | 199 |
330 |
199 |
330 |
sogh_pairs_clones.json.xz |
7 | Data preparation for manual evaluation | – | – | – | – | – | – | sogh_pairs_clones_files.tar.xz sogh_pairs_clones_diffs.tar.xz |
8 | Manual selection of APIs | – | – | – | 135 |
85 |
135 |
sogh_pairs_diffs_apizations.tar.xz sogh_pairs_diffs_different_semantic.tar.xz sogh_pairs_diffs_tests.tar.xz sogh_pairs_diffs_false_positives.tar.xz |
9 | Coding patterns | – | – | – | – | – | 135 |
parameters_patterns_coding.csv return_patterns_coding.csv |
10 | Manual application of the patterns | – | – | – | – | – | 135 |
parameters_patterns_analysis.csv return_patterns_analysis.csv |
1. GitHub archive
We started from the GitHub archive on Google BigQuery with the dump of May, 2019
.
2. Filter Java files with an explicit link to StackOverflow
We queried BigQuery and extracted those Java files having an explicit link to a StackOverflow page, %stackoverflow.com%
.
In particular, we queried the bigquery-public-data.github_repos.contents
table.
We produced the file gh_files.jsonl.xz
.
3. Remove duplicates and links extraction
We cleaned the data from duplicates and recognized the IDs of StackOverflow questions and answers.
We processed gh_files.jsonl.xz
through a Python script that saves the filtered GitHub files into gh_files_cleaned.jsonl.xz
.
We also created the question_ids.csv
and answer_ids.csv
files, which contain the IDs of the linked questions and answers, respectively.
In the case of answer_ids.csv
, we found the associated question IDs in the next passages, being an answer ID unique regardless of the question.
It might also happen that a single file has multiple StackOverflow links.
4. Extract answers text from StackOverflow
We uploaded answer_ids.csv
and question_ids.csv
into BigQuery.
Then, we ran a query on bigquery-public-data.stackoverflow.posts_answers
to retrieve the answer posts on the May, 2019
dump of StackOverflow.
The answers IDs correspond to those links in which an answer post was explicitly declared.
For this, we used the IDs belonging to answer_ids.csv
.
We saved the results into a CSV file so_answers.csv.xz
.
Instead, we ran another query to collect all the answer posts related to the questions we extracted from links, using the IDs in question_ids.csv
.
We saved the results into a CSV file so_answers_to_questions.csv.xz
.
Until this moment, we have the information on all the involved questions and possible answers.
5. Combine GitHub files and StackOverflow answers
We combined gh_files_cleaned.jsonl.xz
containing the Java files, with the answers contained in so_answers.csv.xz
and so_answers_to_questions.csv.xz
, using a Python script.
We also removed all the files for which we could not obtain all the required information.
We save the results into the file ghso_files_answers.jsonl.xz
, which contains for every GitHub file all the possible answers.
6. Type 3 clone detection
We processed ghso_files_answers.jsonl.xz
to extract public
methods from files, only those reporting an explicit link to StackOverflow in the documentation/comment.
We converted all the StackOverflow possible answer texts into code snippets.
For each links between a method and possible snippets, we created the pairs of (snippet, method)
.
We also filtered out those snippets with less lines of code than 6
.
Finally, we detected the type 3 code clones between snippet and method with a 70%
matching threshold.
We saved the results in the sogh_pairs_clones.json.xz
file.
7. Data preparation for manual evaluation
We prepared the data for manual evaluation.
We parsed sogh_pairs_clones.json.xz
to prepare a folder for every pair so#{so_answer_id}_gh#{gh_file_id}
, containing:
- the
so#{so_answer_id}.java
snippet from StackOverflow - the
gh#{gh_file_id}.java
matching method code from GitHub
We saved all the files into a folder, compressed into the file sogh_pairs_clones_files.tar.xz
.
Then, we created the HTML files, visually showing the differences between the snippet and method, using the diff and diff2html utilities.
We saved all the files into a compressed file sogh_pairs_clones_diffs.tar.xz
.
8. Manual selection of APIs
We pruned the pairs due to spurious code clones, those snippets that were included into a larger GitHub method. We manually evaluated and classified all the pairs:
sogh_pairs_diffs_apizations.tar.xz
, the final set of pairs used for the analysis, as examples of valid APIzationssogh_pairs_diffs_different_semantic.tar.xz
, the pairs where the StackOverflow snippets were included into more complex methodssogh_pairs_diffs_tests.tar.xz
, not valid pairs because they are test casessogh_pairs_diffs_false_positives.tar.xz
, not valid examples of reuse
9. Coding patterns
We manually analyzed the content of the APIzations to identify possible patterns. We followed a coding process inspired by grounded theory. The list of codes is described in the following table.
Type | Code |
---|---|
Parameter | Undeclared variable |
Parameter | The variable has a constant value |
Return | The variable is the latest statement in the code snippet |
Return | The variable is used as an argument in a System.out.println invocation |
At the end of the coding process, we were able to extract the common patterns then used in our automated approach called APIzator.
We provide the results of such manual analysis in the files parameters_patterns_coding.csv
and return_patterns_coding.csv
, for the parameter and return statements transformations, respectively.
In the provided files, we specify the agreed pattern in the column code
.
10. Manual application of the patterns
As the final step in our study, we tried to manually apply these patterns to the same examples used for observations.
This is intended to simulate how our automated approach called APIzator would behave.
We provide the results of such manual analysis in the files parameters_patterns_analysis.csv
and return_patterns_analysis.csv
, for the parameter and return statements transformations, respectively.
In the provided file, we specified the applied pattern in the column human_classification
.
We indicated the pattern that would be applied by our approach in the column apizator_classification
.
The pattern classification is described in the following table.
Classification | Type | Description |
---|---|---|
none | Parameter | The developer did not convert the variable to a parameter. |
custom | Parameter | The developer applied a transformation by using an arbitrary action we were not able to generalize. |
PATT-notdecl | Parameter | The developer created a parameter from a variable that is only used, but not declared, in the code snippet. |
PATT-const | Parameter | The developer transformed a variable with a hard-coded assignment to a parameter. |
none | Return | There are no return statements in the resulting API. |
already | Return | The return statement is already declared in the snippet and was not changed by the developer. |
custom | Return | The developer indicated a return value by using an arbitrary action we were not able to generalize. |
PATT-latest | Return | The developer derived the return statement as the last assignment to a variable in the snippet. |
PATT-syso | Return | The developer transformed a print to the system output to the return statement. |