SCP Foundation Word Count Analysis
What containment class is word count inflation in SCP entries?

SCP Word Count Analysis

Motivation

A few weeks ago in the weekly scuffles thread in r/hobbydrama, someone asked about what criticisms people saw in fandoms “that seem less like substantive criticism than repeated memes”. That is, things which are often repeated but aren’t actually a true or fair criticism.

Someone else responded with the SCP Foundation that a consistent criticism of “modern” SCPs was that they were lengthy and prone to being world spanning, large in scope, or otherwise apocalyptic. Other points mentioned were overuse of redaction, overuse of cross linking, and a trend towards character driven stories rather than a collection of stand alone weirdness.

This caught my attention, because I absolutely had that impression of the higher number SCPs, and the last time I seriously read much of the site would have been more than a decade ago. If I thought that way back when the highest numbers were in the 3000’s, and people are still saying that to this day, was there any element of truth to this?

Story scope, and most of the other criticisms, are subjective and difficult to analyze quantitatively. Possibly something could be done by checking tags or hyperlinks or similar, but it would likely be fairly annoying and not something I wanted to tackle.

Word count though, that should be fairly straightforward. Enough so that I was surprised I couldn’t find anyone else that had already done it. And if someone has and I missed it, the way too much time I spent on this can be my penance for failing to locate your results.

Yes, to some degree this whole post is me going way overboard to prove someone wrong on reddit as a result of a very mild, and very polite, disagreement. But I also just enjoy doing a bit of data visualization and analysis, so here we are.

Metholodogy

Rather than crawl the SCP Foundation site directly, I made use of the scp-api dataset at scp-data.tedivm.com. Specifically I cloned the github repo on 12/30/2025. Many thanks to the kind people running this service.

For word counting I accessed the “raw-content” data field, which is the “page-content” HTML block, then used BeautifulSoup to remove the license block and extract the remaining text. I then simply split that string along whitespace, and counted the length of the resultant list. For anyone curious, the code for the basic word counting script is included as an addendum.

I am only looking at the standard entries here, none of the stories, joke entries, or similar.

Is this a perfect method? Of course not, but I spot checked a random sampling and this seems to be close enough for my purposes. I then manually checked all entries with word counts under 300 to try and catch any that needed a manual count.

I had also wanted to calculate a “percent redacted” metric, but the inconsistent use of blackout bars, [DATA EXPUNGED], and other redaction methods meant that the easiest methods would be extremely inconsistent. And frankly I was too lazy to try to come up with a more robust approach.

Date vs Number

I was originally going to do this analysis entirely in terms of either SCP Number or creation date, assuming they would be close enough to make no real difference. The argument for SCP Number is that it reflects how a reader is likely to experience the site, whereas data analysis by date might better capture trends within the SCP writing community.

After doing some comparative plots I decided to present the data both ways. I think they’re sufficiently different to give interesting analysis in both cases.

Plotting creation date against SCP number is pretty fun, in my opinion.

A broadly linear relationship, but clustered into boxes for each 1000 entry long series block.

Results

First Looks

Well, right off the bat lets just look at word count versus SCP Number:

And word count versus created date:

And, well, it sure looks like word count is increasing over time and with SCP number.

Honestly I had to pause here and go manually check the SCP that came in at over 100,000 words, because I was sure that was the result of some weird error in my counting scheme. But no, SCP-9317 is just fully a lengthy novel masquerading as an SCP Foundation item. I did not read this entry.

So, case proven? Entries really are getting longer the more time goes on?

A raw scatterplot like this can only tell us so much, and this isn’t really telling us much about the underlying distribution of SCPs, so lets dig in a little further.

But first, one thing I want to call out here are the narrow spikes of longer entries that appear closely clustered. Zooming in, they look to be about a week long, with many of the entries clustered around the beginning and end of that time frame.

I’m guessing this is caused by some sort of writing contest or challenge with a limited entry window, so you get people submitting as soon as it’s opened and also right at the last second. If anyone can shed additional light on this I’d appreciate it.

I ran moving window averages on the data, one in time and one in SCP Number space. For time I included entries created within 15 days in either direction, for a one month window. For SCP Number I included with 25 numbers on either side. That’s specifically actual SCP Number, so missing entries doesn’t cause the window to expand and the number of entries within the window doesn’t stay constant. These windows were chosen to be of a similar fraction of the total SCP Number and time frame of the site. For each SCP, the mean, median, and minimum word counts are plotted for all SCPs within the window.

Rolling averages (and smallest entry) for 50 SCP Number window:

and the same for a one month created date window:

A very clear trend towards longer SCPs in both cases.

I’m calculating median because it should be less influenced by extreme outliers than the mean is. Despite that, it’s interesting to see that in both number and time space, the median and means are fairly correlated. There aren’t many places that the mean spikes and the median doesn’t, which indicates to me that the very clear trend towards longer entries isn’t driven by a small number of very long entries. Instead it really is an overall trend of all entries on the site.

The smallest entry line remains mostly constant (outside a few small windows in the 9000’s) which means that there are still short entries being added to the site. However, a randomly selected recent (or high number) entry is likely to be more than three times longer than those that made up the majority of the site when it began.

Percentage Breakdowns

Next I wanted to get a better idea of the internal distribution of story size than simple averages could tell me. Theses plots are potentially a little confusing to parse, so I’ll try to explain what I did.

I started by ranking all entries on the site into a list from shortest to longest, and divided that list into four quartiles. Then, for each Series (or year, for dates), I plotted the percentage of the stories within that Series that fell into each quartile.

Series quartiles:

Year quartiles:

First, this I think does a very good job of illustrating the shift in entry length across the history of the SCP Foundation wiki.

Second, I think it’s interesting that past Series 3 (and past about 2014, although it’s messier in the time series) that the fraction of the smallest quartile stayed very similar. What the longest stories on the site were eating aren’t the very short stories (although those are still in decline), instead they’re mostly supplanting the medium length ones.

Third, there’s an interesting reversal of the trend that happened in 2025. I have no idea if that’s just random clustering, or indicative of some shift within the SCP community. Anyone more involved in the site is invited to enlighten me on that.

Fourth, although I like these plots a lot, what they don’t capture is the increasing number of extremely long entries. Novel length, you might say…

Book Categorization

The Science Fiction and Fantasy Writers Association defines book categories as follows:

This is specifically in service of the Nebula Awards, but I don’t have any better definitions at hand, so lets repeat the prior analysis but with book categories rather than quartiles:

Book categories by year:

Book categories by series:

And uh, yeah. Not very enlightening honestly, it’s about the same as the prior plots but with significantly less useful categories. The vast majority of SCPs fall into the Short Story category, so all this really does is show that the fraction of very long entries is going up, but we already knew that.

Ratings

The last thing I did before realizing I’d spent way too long on this project was try to see if there was any predictive power between SCP length and how well liked that SCP is. Both ratings and word count span many orders of magnitude, so I felt a log scale would be most useful, and this is what we end up with:

I masked out negative ratings (which do not play nice with logarithms), offset word count by 1 (so the one zero word entry doesn’t cause issues), took the log of both, sorted all the entries into evenly spaced bins in log space, then normalized the results from 0 to 1. The heat map was treated similarly, since although I could have simply used direct counting, taking the log lets you see a bit more nuance in the plot. The reason you see some horizontal, separated lines towards the bottom is just the underlying discretization of the rating data (ratings must always be integers after all).

This is a ton of explanation for the final result of: there is no particular correlation between word count and rating. To be fully rigorous I really should have normalized the ratings to account for the fact that rating roughly trends with the age of an entry, but honestly this has all already taken much longer than I had intended it to, so maybe that’s for a followup post.

Conclusions

So, yes, as was clear from the start, SCPs are, on average, significantly longer now than they used to be. There’s a pretty clear trend towards higher word count, and an entry posted today is likely to be more than three times the length as one posted back at the start of the site.

There are still short stories; although of slowly diminishing fraction, there has been a fairly consistent contingent of very short SCPs written and added to the site over its history. There even seems to be some recent movement in the past year that bucks the overall trends.

Is this good? Bad? Entirely subjective, and I’ll keep my personal thoughts to a followup section to not pollute things here.

However, without a doubt, the “common criticism” that SCPs are getting longer is one that is entirely based in fact.

The other aspects: high degree of interconnections with other SCPs, a heavier emphasis on “lore”, a trend towards large scale apocalyptic stories? This isn’t captured by my analysis although anecdotally all of those things seem to be well correlated with word count, so I’d be shocked if those weren’t also on the rise.

There’s a lot more analysis that could be done on this data set, and I’d like to dig further in, but doing the manual checking of word counts took so much longer than I was expecting it to that I just don’t have the motivation to do so at this time. If anyone else would like to jump in and do some additional data science, I heartily encourage it. My own code for grabbing the data is found at the bottom of this post, but it might be worth someone’s time to try and find a more rigorous way to capture the the sort of SCPs that required me to do hand counts.

Personal Thoughts

I don’t particularly think my opinion on what makes a good SCP holds much weight. The last time I really engaged with the site prior to this project was around ten years ago, and I never went deeper than just reading the site. If the people currently writing, reading, and participating in the SCP community like and enjoy the longer, more involved, more story oriented entries then they should continue to do what makes them happy.

That being said, I simply enjoy the smaller, less bombastic entries much more. The thing that made the SCP wiki interesting to me in the first place is the intrinsic constraints that the format places on the writer. It’s an obvious truism, but restrictions really do breed creativity. I find this especially true for amateur writers (among which I count myself), as it forces the author to be much more careful and deliberate with what to include. The most popular, and probably best, TTRPG thing I ever made was a result of me embracing heavy restrictions (fit it on a business card) rather than trying to subvert or avoid them. (This is entirely free, this is not intended to be self promo.)

The inherent restrictions of the SCP Foundation format prevented the entries from devolving into simple creepypasta, and most creepypasta is terrible. Consistently I find the first 20% of a creepypasta style story to be interesting, compelling, and spooky. And just as consistently I find the final 20% absolutely throws it all away and ruins the whole thing. Often it’s by over explaining, over playing a hand, blowing up the scale too big, losing the grounding that made the story interesting in the first place.

This is absolutely personal taste, and speaks to the sort of stories I find compelling. For an example: I thought qntm’s There Is No Antimimetics Division was fantastic while it was still a series of loosely connected stories, and I lost more and more interest and it became clear that everything was connected, and it was all leading up to a huge apocalyptic climax. I felt very similarly about their other novels, like Ra or Fine Structure Constant. By comparison, I think the short form fiction they have on their site is consistently great. I had a similar arc with The Magnus Archives: loved it until it started turning into a big, interconnected ‘cinematic universe’.

What I loved about the SCP Foundation was that the format encouraged and (to some degree) enforced writing entries in a way I found very appealing. We don’t get complete entries or data, we are getting a very narrow, very skewed view of something weird. The purpose of the wiki is a quick reference for containment procedures, not a database of all available information, and even that is often heavily redacted. It’s a psuedo-government entity, so the the language is dry, clinical, and flat. We’re peering through a narrow window at some strange and spooky thing, and we never learn enough to yank the rug of suspension of disbelief out from under our feet.

I wish I could say that this project rekindled my love for the wiki as a whole, but if anything the opposite is true. The ones I was reading were mostly the ones I had to manually check, and the ones I had to manually check were usually doing something funky with the format, which turned out to be a great way to find apocalyptic SCPs or character driven stories that are absolutely not what I come to the SCP Foundation to read. I did run across some that tickled me, I genuinely really like SCP-2505 for instance, but overall I am not hugely motivated to really dive back in. Especially when, as I found while checking the discuss pages for source code or text, it’s clear that the SCP Community broadly does not share my particular taste.

If you’d to dunk on my own writing to discredit my criticisms, I wrote an SCP Foundation adjacent TTRPG supplement that I will add a bunch of free community copies to, so feel free to grab one at your leisure. Not A Place Of Honor is a combination of the internet famous nuclear waste containment documents, a fantasy oriented SCP Foundation, and a “found footage” style art book. It’s also, ostensibly, something you could actually use at your table in a game so the goals and motivations don’t entirely align with what the SCP wiki is doing. Still, criticize away! (For free! Still not intended to be self promo, just putting words to action.)

The data analysis was fun, the manual word count checking was not, and I hope you enjoyed looking at the fun plots if nothing else. Thanks for reading!

Manual Counts

I manually checked all articles that registered fewer than 300 words. This seems to have done a pretty good job of finding articles that bury their text past a link, embed, or something otherwise unusual.

I’m not going to manually check all (nearly) 10,000 pages, but if you know of any other articles likely to have “hidden” text, please feel free to let me know. Automatically checking for “offset” pages would find some, but not all, hidden text but it was outside the scope of what I wanted to do for this project.

There are some number of audio based articles (SCP-1159 for example) which I have decided not to try and transcribe myself out of pure laziness, so the word counts for these are going purely off of the actual text.

For what it’s worth I don’t think the potential missing word count is going to materially change the overall picture, and if anything I am undercounting the latter SCPs.

I started writing this section with a threshold of 5 words, then bumped it up every time I cleared them out to check for more. Once I got up to the 100-200 words count range, there were too many for me to make special notes for all of them, so I’m just including the ones I had already written. This took me much longer than I had anticipated.

For all manual counts I maintained the rule of including pages that seemed to be exclusive to that particular entry, but not including pages that seemed “supplementary”. That is to say I’m not including stories, site dossiers, or similar just because they’re linked to in an entry.

SCP-2062

I did not include all of the hovertext, but the main page text comes to 459 words.

SCP-2212

All text on the main page is redacted, so I have manually added the word count of the linked “archival” pages at 2741 words. I’m guessing there’s some additional text if the logic puzzles are solved, but I decided I am too lazy to work it out so this article’s word count will be underestimated.

SCP-2521

All images. If a picture is worth a thousand words should I count this as… 15 thousand words? I’ll keep this as is, which I think is within the spirit of the analysis.

SCP-3125

The article on the page (past the code lock) adds up to 1984 words. I’m not including any of the further, linked material (funnily enough including the entirety of There Is No Antimimetics Division would still only be the 4th longest entry).

SCP-3340

The disappearing text has 384 words prior to deletion.

SCP-3493

A multiversal stroll for a total of 4423 words, not including UNI-7411 which has minimal recognizable text.

SCP-3677

The comic anomaly! An OCR scrape of the images puts this at 1913 words.

SCP-4205

A semi interactive terminal, should probably have found a way to pull this out of the page source but instead I manually copy/pasted it page by page for a total of 5031 words.

SCP-4500

A dialogue of 1434 words.

SCP-4673

Agent Edward Carter Hardwick affirms this article has 824 words.

SCP-4707

Grabbing the text manually puts this at about 500 words, give or take, which I’m going to use as the word count.

SCP-4939

Sum of the main page and all three offsets is 2681 words.

SCP-5646

I decided not to include the linked 001 proposal, so the main page comes in at 910 words.

SCP-5999

Word count of 8731 retrieved from the static “reduced” version of the page.

SCP-6017

The prompts plus article come in at about 889 words.

SCP-6298

1063 words from the fully ‘expanded’ page.

SCP-6548

8394 words past a fake login page.

SCP-6634

A points and click adventure video game SCP, word count pulled from json file offered for translation purposes. I did not go through the effort of extracting purely actual game text, so the 5000 word count used here is just an estimate.

SCP-6779

Main page plus both linked pages come to 1634 words.

SCP-6988

Clicking through and expanding all sections (but not counting the linked site dossier) gets to 1849 words.

SCP-7021

All pages summed, minus some of the fiddly interactive incident report things in the middle, come to 6190 words.

SCP-7057

Summing the different page versions gets to 3248 words.

SCP-7359

Beautiful soup to the rescue once again! Even after removing the repetative “east” text this comes in at a whopping 34374 words.

SCP-7535

1528 words not including the embedded footnotes.

SCP-7974

Although there are seven alternate pages here, they’re close enough that I’m counting it as the word count of the final one plus a hundred words, for a total of 2786.

SCP-8054

1917 words of faux terminal system.

SCP-8145

This one is genuinely small, but I’ve added the text of all poems to the word count for a grand total of 135 words.

SCP-8992

An article with two versions, manually running a word count them gives a combined total of 2729 words.

SCP-8500

Another video game SCP. As with SCP-6634, the word count estimate of 10000 words is based off of the provided json file containing all game text.

SCP-3923

3923 words of faux terminal and connected pages.

Word Counter Code

Cloning the git repo and running this python code in the root directory should create “ModifedIndex.json” which is a copy of the included “index.json” with the addition of a “word_count” field in each entry, and only including the normal “article” entries.

Commenting out the code under “# Manual Counts” will remove my efforts to correct the entries noted above.

Note: you could remove the tqdm import and the tqdm wrapped around the inner loop if you don’t want to install that or don’t want to see progress bars.

import json
from tqdm import tqdm
import os
import numpy as np
from bs4 import BeautifulSoup

# Define path to index files
content_path = "/scp-api/docs/data/scp/items/"
cur_dir = os.path.dirname(os.path.realpath(__file__))

# Create empty dictionary to store data
new_dict = {}

# Open index file
with open(cur_dir+content_path+"index.json") as index_file:
	index_data = json.load(index_file)

	# Step through each series file
	for series in ["1","2","3","4","5","6.0","6.5","7.0","7.5","8.0","8.5","9.0","9.5","10.0","10.5"]:

		# Open series file
		with open(cur_dir+content_path+"content_series-"+series+".json") as series_file:
			
			series_data = json.load(series_file)

			# Step through each article
			for key,value in tqdm(series_data.items(),desc=series,disable=False):

				# Extract html
				soup = BeautifulSoup(value["raw_content"],features="lxml")

				# Remove license box
				for div in soup.find_all("div",{"class":"licensebox"}):
					div.decompose()

				# Count words (split text at white space, count resultant list)
				cur_size = (len(soup.get_text().split()))

				# Add word count entry to index file
				index_data[value["scp"]]["word_count"] = cur_size

				# Copy meta-data entry to new dictionary
				new_dict[value["scp"]] = index_data[value["scp"]]

	# Add Manual Entries
	## Comment out code starting here to remove manual additions
	manual_scps = {
					"SCP-2062" : 459,
					"SCP-2212" : 2741,
					"SCP-3493" : 4423,
					"SCP-3125" : 1984,
					"SCP-3340" : 384,
					"SCP-3677" : 1913,
					"SCP-4205" : 5031,
					"SCP-4500" : 1434,
					"SCP-4673" : 824,
					"SCP-4707" : 500,
					"SCP-4939" : 2681,
					"SCP-5646" : 910,
					"SCP-5999" : 8731,
					"SCP-6017" : 889,
					"SCP-6298" : 1063,
					"SCP-6548" : 8394,
					"SCP-6634" : 5000,
					"SCP-6779" : 1634,
					"SCP-6988" : 1849,
					"SCP-7021" : 6190,
					"SCP-7057" : 3248,
					"SCP-7359" : 34374,
					"SCP-7535" : 1528,
					"SCP-7974" : 2786,
					"SCP-8054" : 1917,
					"SCP-8145" : 135,
					"SCP-8500" : 10000,
					"SCP-8992" : 2729,
					"SCP-3923" : 3923,
					"SCP-245" : 2812,
					"SCP-1663" : 3837,
					"SCP-2111" : 2735,
					"SCP-2505" : 355,
					"SCP-2521" : 0,
					"SCP-2744" : 6001,
					"SCP-3211" : 1821,
					"SCP-4069" : 3787,
					"SCP-4706" : 2180,
					"SCP-4930" : 404,
					"SCP-5011" : 1939,
					"SCP-5153" : 675,
					"SCP-5356" : 1072,
					"SCP-5647" : 625,
					"SCP-5921" : 3357,
					"SCP-6038" : 4861,
					"SCP-6465" : 1532,
					"SCP-6688" : 1803,
					"SCP-7100" : 12439,
					"SCP-7259" : 505,
					"SCP-7345" : 2629,
					"SCP-7472" : 1020,
					"SCP-7585" : 21107,
					"SCP-8559" : 1104,
					"SCP-9003" : 2853,
					"SCP-9008" : 33361,
					"SCP-9034" : 6966,
					"SCP-9102" : 10820,
					"SCP-9494" : 4013,
					"SCP-9822" : 7066,
					"SCP-1796" : 1066,
					"SCP-3272" : 697,
					"SCP-4742" : 10384,
					"SCP-5235" : 1520,
					"SCP-5623" : 1855,
					"SCP-5971" : 3089,
					"SCP-6000" : 7604,
					"SCP-6006" : 5609,
					"SCP-6689" : 4201,
					"SCP-7376" : 8301,
					"SCP-8009" : 6917,
					}

	for scp,words in manual_scps.items():
		new_dict[scp]["word_count"] = words
	## End commenting out for manual additions

	# Save dictionary with word counts as json file
	with open("ModifedIndex.json",'w') as new_file:
		json.dump(new_dict,new_file)
*****
☕ Pudhina theme by Knhash 🛠️