Analysis Methods

The purpose of this investigation is to determine whether it is possible to measure quality of a story in an unbiased and reproducible way.

Rubric analysis is a well-known evaluation method for large bodies of text. It basically means that you define seperate categories/themes/topics to evaluate the text (the so-called rubric-axes) and score the text on each category. To guide the scoring process, quantitative scores are accompanied by arguments that state what awards a certain score (the so-called descriptors). If you are interested. If you are interested, you can find more about rubric scroing methods here:

The scope of our analysis are serious Hermione/Viktor Harry Potter fanfictions (so-called vikmiones), meaning:

The story is a Harry Potter Fanfiction.
One of the main focus points in the story is the Hermione/Viktor relationship.
Hermione/Viktor is one of the endgame pairings.
The story contains meaningful narrative content (explicit scenes are not the main purpose of the text).

You can find the rubric that was used for this narrative scope here. The rubric has been tuned to the scope, but formulations and criteria have been specifically formulated in a way to only judge the quality of execution of the work; not to punish any creative choices on plot, characterizations, worldbuilding, etc. As such, the rubric is aiming to be discrimination-free within the scope.

Next, a suitable portion of vikmione has to be selected, and the rubric has to be applied to each of them. This is done with generative AI (we used ChatGPT, with GPT-5.3). It is no certainty that an AI reader is able to judge a story better then a human reader. AI evaluation was chosen for its consistency and reproducibility, rather than for superior interpretative ability. Finally, the results were interpreted using statistical analysis. Read below about the details on each of the different steps.

Dataset selection

Vikmione stories were selected from AO3, one of the largest sites to host fanfiction in the world. However, it also has a detailed tag-system on relationships, characters, and other story elements that allow us to easily and effectively sweep the site content, and a standarised PDF-export. The PDF-export is an important feature, as AI-readers are heavily influenced by the format and structure of the documents offered. By using a standarised PDF-export for each investigated story, this type of bias can be effectively eliminated from the story.

You can use the AO3 tag-system to search for vikmione stories here. In this analysis, we did not consider vikmiones from another source then AO3 (because those document formats are different). We sampled the site at date 9th of APril 2026.

We used the following criteria to reduce the set of all vikmiones on AO3 to a workable scope of 'serious' stories:

Select Harry Potter - J. K. Rowling as the only fandom, and exclude cross-overs.
Select Language: English This is the largest portion of available languages, and we wish to eliminate biases in the analysis due to different languages.
Select Wordcount: >=50k The number of 50k words is fairly aribtrary, but the purpose of this filter is to distuingish one-shots and short-stories from longer ones.
Exclude other Hermione relationships, for example:
- Hermione Granger/Draco Malfoy
- Hermione Granger/Ron Weasley
- Hermione Granger/Harry Potter
- Hermione Granger/Severus Snape
- Hermione Granger/Charlie Weasley
- Hermione Granger/Fred Weasley
- Hermione Granger/George Weasley

On the 9th of April 2026, the total collection of Hermione/Viktor stories was 1716 works. The above selection steps reduced this to 122 works. See the definition of the vikmione scope above. usually, when Hermione/Viktor is combined in a story with another Hermione-pairing, the Hermione/Viktor-pairing is not endgame. This is not a hard-enforced rule, but a very strong pattern. Hence, together with the wordcount-filter (longer fics with meaningful narrative content), crossover-filter (our scope is pure Harry Potter) and the language-filter, this is a reasonable attempt to identify the scope of the analysis.

The second selection-phase, is to manually read the tags & summaries of all stories that pass the above filter (122). This was done to verify whether a story's narrative purpose was indeed a serious vikmione (as defined above), or a story aiming at explicite content and/or romantic triangles involving Viktor Krum, or a story with a totally different scope.

This reduced the datatset further from 122 stories to 30 stories. The list of these stories can be found here. These are the stories that comprise the full dataset of our analysis. Note that some stories in this dataset do not pass the wordcount-filter. However, they were included because it was known that they match the scope. As such, the above criteria should not be viewed as absolute or justifiable, but as a first-attempt to identify a suitable dataset on AO3 matching our scope.

Hence, if there are any other stories that you feel should be included in this analysis (because they match the above scope), you are welcome to contribute to this analysis. You can create a pull-request here and suggest additional stories. Provide a clear argumentation as to why the story belongs in the scope, and note that the story must be available on AO3 (this is a hard-requirement), because biases from different document-formats cannot be accepted.

Evaluation Pipeline

To apply the rubric to the Vikmione stories in our dataset, the following procedure was used:

Choose a standard AI-tool and LLM for the evaluation (We used ChatGPT with GPT-5.3)
Choose a standard document format for all stories, to counteract AI sensitivity to document format. For example: the AO3 PDF export.
Switch off memory functions and cross-chat functions
Open a fresh empty chat
Load a single rubric axis into the first prompt; Terminate the model response immediately after the rubric axis has been acknowledged.
Into the second prompt, load as many PDF documents as permitted by the models features (in our case the limit was 20).
Add the following prompt text:

Score all supplied PDFs against the above evaluation model. Briefly argue each score using the source material. Rely solely on the descriptors for calibration of the scoring scale. Do not compare between the stories at all. We want independent and non-relative scores. NB: use only the narrative body text. Do NOT use or infer information from: Tags, Author Notes, Summary, Chapter Titles, or other metadata of any kind. If such elements are present in the text, ignore them entirely.

Repeat the process for all stories under consideration, and for all 10 rubric axis. NB: Use new chats every time, at least when switching rubric axis!
Scores for each axis were aggregated manually into a total score (see the rubric) for the exact rules).
The first outcome of the LLM is final, do not supply any additional questions/context to steer the evaluation, as these biases the scoring between stories.

Irregularities:

In the case of canon-consistency, AI/LLM evaluation tends to structurally underscore this axis, due to a lack of causal reasoning power. LLM’s tend to map how much a character’s actions match their canonical outcomes, but this is the wrong approach. The axis clearly states: how likely it is that the canonical version of the character makes the same decision given the new circumstances. Even though the axis text clearly states this, an LLM is not very capable of picking this up, Hence, A secondary prompt was used to correct for this effect. Final scores were accepted for this axes, only after manual verification of the model’s reasoning.

NB: Do not take into account whether plot circumstances look like canon, or are extreme. What matters is, would the characters act the same as their canonical counterparts do, under those new (and possible more extreme) circumstances. Use causal reasoning. We are solely interested in the characters, do not weight the circumstances. Rescore all documents. Note that this is a less strict rule then what you previously applied.

But note that this is no guarantee. One still must carefully read the justifications and scores and then determine whether this sounds reasonable. As such, Canon-Consistency is simple less reproducible with LLMs then the other axes.

Theme Quality measures reader impression, which simple fluctuates a bit more then the other axes in its evaluation. But the pipeline can be used. There is simple more fluctuation, not a systematic bias, as is the case with canon Consistency.

Statistical Analysis

All stories in the dataset were evaluated twice (independently) for each rubric axis using the above procedure. Afterwards, a total score was assigned for each evaluation separatly (procedure is described (here). Next, the data was combined with various metadata-fields from AO3, such as completion status, wordcount, number of hits/kudos/comments, etc. Generative AI was also used to provide content-overviews such as a plot summary, strongest and weakest aspect of the story, and Viktor Krum's role in second wizard-war. The resulting data-table can be viewed here.

The results were further analysed using the following statistical methods:

Cronbach's alpha, which was calculated for each of the two evaluations, to measure the internal consistency of the rubric. This is common is scientific studies, see here and here for examples.
Spearman correlation, which was calculated to investigate the stability of the quality-ranking by the rubric (see this paper for more background information). Both evaluations were aggregated into their own total score, providing a ranking of the stories in the dataset. These two rankings were compared to generate the Spearman correlation coefficient. As tie-breaks, the score drift was used (low drift is better, see our rubric).
Mean Relative Deviation (MRD), which was calculated to investigate the stability of the absolute scores of the rubric. For the total score T of a story, we computed 2*abs(T1-T2)/(T1+T2). Then, we took the average over all 30 stories of this number, to see how much absolute scores can deviate between evaulation runs.
Correlation study between the popularity of a story and the rubric score of a story.

Now, Cronbach's alpha measures the internal consistency of the rubric, Spearman correlation measures relative/ordinal stability of the rubric and MRD measures absolute stability of the rubric. While this is all very viable information, it does not yet state whether the rubric indeed measures story quality. As such, the correlation between the total rubric score and the popularity was investigated for all 30 stories in our database.

This brings us firtst to the question of how to measure popularity. AO3 tracks various types of metadata to measure this:

Hits: The total amount of times the story page is opened on the site.
Kudos: The total amount of times reader press the Kudos-button to show appreciation.
Comments: The number of times a reader gives a written response to a story.
Bookmarks: The number of times a reader marks the story for recommendation and/or future reference.

We choose to base Popularity on the number of Hits. This is done, because the other data is highly sensitive to human behaviour. Readers may have read a story and appreciate it, while not contributing to Kudos, Comments, or Bookmarks. There are various explanations for this, such as not having the time to click buttons, using a device (such as a smartphone) where typimng comments is less easy, or a story can be blocked to accept kudos from guests, etc. But Hits is unbiased by human behaviour. It simply measures the amount of times a story is clicked/loaded. Note that this is still fairly different from whether a story is actually read and/or appreciated, let alone quality. But it does seem like the best metadata to use.

Now, AO3 counts hits cumulative, meaning that it shows the total number of hits ever generated since the first part of the story was published. As such, a story that is published two years ago will like have double the hits of a story that is published one year ago. As such, we approximate popularity as Hits/Time.

However, Hits/Time is not yet a very good measure of Popularity, as new stories are shown at the top of the page on AO3 by default. As such, many stories typically collect more hits in the beginning of their life and less when they exist longer. So Pure Hits/Time would be unfair to older stories which have actually established a large number of hits. Therefore, we asked ChatGPT to construct correction factors for this problem based on typical Harry Potter Fandom behaviour or well-known stories. In our analysis, we have used:

<1 week ago published: Correction = 0
1 week ago published: Correction = 0.15
2 weeks ago published: Correction = 0.25
3 weeks ago published: Correction = 0.33
4 weeks ago published: Correction = 0.4
2 Months ago published: Correction = 0.55
3 Months ago published: Correction = 0.65
5 Months ago published: Correction = 0.74
6 Months ago published: Correction = 0.78
1 year ago published: Correction = 0.88
2 year ago published: Correction = 0.94
3 year ago published: Correction = 0.97
4 year ago published: Correction = 0.98
5 year ago published: Correction = 0.99
6 year ago published (or more): Correction = 1.00

As such, we define a story's Popularity score as: Correction*Hits/Time with Time measured in days between the first publication date of the story and 9th of April 2026 (our sampling date). This is a measure on how often the story is read, picked, chosen, etc. by people. Our suspicion is, that popularity is notibly correlation to the rubric score, but not perfect. As people choose their stories based on a combination of quality, marketing (how visible is the story on social media) and personal preferences.

While the rubric aims to measure quality independent of creative choices such as plot content, characterizations, worldbuilding choices, etc; personal preferences are usually highly dependend on such choices. Therefore, we expect to find a significant but far from perfect correlation between Popularity and Rubric score, where the correlated component is the 'independent story quality' we hope to identify.

The Popularity scores, as well as the rubric total scores can be found in the Results. All basic metadata can be found in the Raw Data in See Discussion for the outcome of these calculations and its interpretation.