Code-Code line completion results on corrected/completed 10,000 sample test JSON are incorrect by one or two orders of magnitude

I was excited to find a benchmark to experiment with RAG poisioning utilizing ideas from the paper, "AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases".  I also read the ReACC paper, "ReACC: A Retrieval-Augmented Code Completion Framework", by Microsoft researchers.  If RAG poisoning is to be considered successful, it should behave with similar performance on samples that are not "triggers".  Thus, I decided I would use the CodeXGLUE benchmark to measure my poisoned RAG for line-level code completion, both before the poisoning and after.  I am working with the Python language.

With minor issues of "code rot" aside, I was able to get CodeXGLUE working.  For the task of Code-Code next line completion, I saw advertised the tables listed here:  https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line.  On Py150 for CodeGPT, the EM is listed as 42.18, and the edit similarity is listed as 71.23.  Results in a similar "ball park" are listed for CodeGPT-adapted.

As the posters for issues: 102 and 130 noted, there is no ground truth in the test.json file for the Code-Code line completion task.  However, the test.json file does list 10,000 "cut" input programs, along with IDs into the source list found in the file, python50k_eval.txt.  The correspondence between the "id" in the test.json and the corresponding file in python50k_eval.txt is easily verified by inspection.  Of course, as the posters in issues 102 and 130 point out, there is no ground truth.  Nevertheless, the ground truth is easily to reconstruct.  By simply counting the number of "<EOL>" tokens in test.json, for each entry, we can determine the the "cut point" in the file. 

We have both the file from which each sample in test.json was "cut" from, and we have the exact "cut point" in the file.  The very next line after this cut point is the ground truth for the next line completion task.  I am only interested in a single line of completion, which is similar to what is demonstrated here:  https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line.  I reused and modified accordingly the already provided "preprocess.py", that can be found in the directory, "CodeCompletion-token/dataset/py150/preprocess.py".

Where my issue lies is when I perform the inference step mentioned here:  https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line,  I get the following output using a CodeGPT finetuned CodeGPT-adapted model.

Edit sim: 1.4198, EM: 0.0044

NOTE, that these values are multiple orders of magnitude off from the advertised results.  I have high confidence in the finetuned model, as well as high confidence in the test.json reconstruction (the process is fairly simple, and I inspected the data manually with many examples).  NOTE also that I am using your provided run_lm.py for evaluation,with appropriate/similar parameters as listed on CodeXGLUE.

To summarize, the multiple order of magnitude disparity in my results, and the advertised results on CodeXGLUE is concerning.  I would be happy to provide the CodeXGLUE maintainers 1) reconstructed test.json, 2) run_lm.py inference configuration and runtime parameters, 3) finetuned CodeGPT-adapted model.  Regarding ground truth, your advise to the posters in issues 102 and 130 was this:

"As a benchmark dataset, we don't release the ground truth in some tasks. You can participant our benchmark leaderboard by sending submissions to codexglue@microsoft.com."

It is difficult for anyone to verify what is published in the inference results tables at the CodeXGLUE Website (https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line).  However, by simple reverse engineering of the test.json file, these numbers should be verifiable.

The careful attempt mentioned above failed to reproduce these results, and moreover, the results were oders of magnitude worse.  

I ask that the issue be pursued more and that the results as posted, can at least be verified.  I am willing to share all artifacts I used in producing these results.  As a side note, I find, intiutively, that the advertised EM of 42.37, for example, for Py150 is suspiciously high.  I note that the data preprocessing of Python code, described in the CodeCompletion section of CodeXGLUE, completely quashes the semantic value of indentation.  This is mentioned and discussed briefly in issue #75.  In that post, one of the maintainers states the following:

"we know that it may convey useful information but in our experiments, we found that we don't need preserve them in purpose as the pre-trained model can easily learn them from codes."

This statement surprises me in that even skilled python programmers would be challenged in reconstructing a python program in which all consecutive strings of whitespace are replaced by a single space.  My intuition argues that much semantic information is lost, and that the LLM would be better off with this information (simply encode, for example, <INDENT> <DEDENT>, as the produced by the tokenizer).

In short, with the simple reverse engineering of the "ground truth" for test.json, can you help verify the advertised inference results posted in the table here:  https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line.  Again, I'm happy to provide my test.json, inference call parameters and model.  NOTE: I'm not asking for you to reveal your benchmark dataset,  I have already reverse engineered one.  Second, I'm not asking for a "black box" solution ---- sending a submission to codexglue@microsoft.com provides no proof of verifiability.  What I'm asking the maintainers is, given my test.json, with valid ground truth, why aren't the inference results in the "ball park" of the published results.

I thank you in advance for your consideration and time with this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code-Code line completion results on corrected/completed 10,000 sample test JSON are incorrect by one or two orders of magnitude #194

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Code-Code line completion results on corrected/completed 10,000 sample test JSON are incorrect by one or two orders of magnitude #194

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions