thewhitelily: (Default)
So, I've got a functional version 1.0 of Squidesaurus, the relational thesaurus for writers.

The current idea is that, rather than trawling the Internet at large, it uses the full text of books from Project Gutenberg to build up its database of word relationships. Hopefully, eventually, it'll trawl Gutenberg without me having to download and feed in each book separately.

Not sure, as yet, whether it will be actually useful, but it's certainly been an interesting project. I've taken the opportunity to come to grips with WPF and C#, both of which are simultaneously cool and frustrating. The algorithms involved in cleaning up the text, removing stop words, stemming the words (so that "running", "runner", and "runs" gets put into the same box with "run"), and then actually traversing the text searching for word pairs, are quite involved and intriguing.

My test case has been Moby Dick, which I thought would have some nice high-frequency combinations in there. The results are substantially like I expected, which is always a good thing to hear when developing software.

Looking within five words of:
"whale" gives back "sperm", "white", "right", and "great"
"ship" gives back "sail", "whale", "sea", and "boat"
"leg" gives back "ivory", "one", "ahab", and "lost"

Now I've also fed in Twenty Thousand Leagues Under the Sea, Two Years Before the Mast, and Treasure Island. Yes, there's a theme.

Now, within five words of:
"ship" gives back "board", "sail", "crew", and "like"
"sea" gives back "red", "upon", "bottom", and "open"
"water" gives back "surface", "fresh", "salt", and "clear"
"squid" gives back "apparition", "arm", "thing", and "whale"

No results back yet on the 'usefulness' issue, but hopefully, as I feed it more and filter the results to compensate for words that are just plain common, I suspect it should get better.

Even if I do say so myself, I'm rather impressed.
thewhitelily: (Default)
I’ve realised one of the reasons why I’m usually disappointed when I consult a thesaurus looking for whatever that word is that I’m really looking for—what I’m usually looking for is much closer to a word association dictionary than anything else.

For example, at the moment, I’m working on a paragraph that begins: It was like kissing a squid.

I love the image. For starters, squid is an excellent word in this context, because it’s short and punchy and the squ sound is uncommon enough to bring a little surprise to intensify the humour and onomatopoeically bring to mind a host of appropriate unpleasant words like squish, squelch, or squeeze. It’s creepy and unnatural and undignified; it’s a confusion of flailing limbs; it’s cold, wet, and impossible to escape the suction to surface for air. And don’t even get me started on camouflage, grotesque intelligence, coiled arms, inky eyes, beaky noses, or cold fish.

The word use is coming really easily; I’m having to rein it in with both hands and my teeth to halt the descent into madness. That's no problem. I'll pick a few (perhaps three?) of the best, and then scatter a couple of the more subtle ones into the remainder of the scene, the reader’s mind will fill in the rest.

But finding the words doesn’t always come this easy, and it’s obvious, now I think of it, why I hardly ever find it helpful to look in a dictionary, thesaurus, or even an encyclopaedia for help with imagery. None of them are close to the actual relationship between the words that I’m looking for, which I guess is why it’s so damn hard, and why it’s so damn awesome whenever you find a writer who pens excellent images.

Surely there’s a niche market out there, though, for an imageaurus? Something like Visual Thesaurus, where you can follow the links to relations of relations, to go from squid to ink to black, or squid+ink to cloud, etc.

What I really need to do is build a web crawler that examines text for words commonly found near “squid” (and every other word, of course) and ranks the strength of their relationship based on number of hits and proximity. Of course, it wouldn’t go all the way to producing original and compelling images, any more than a dictionary or a thesaurus does. It would probably even contribute the problem of having every kiss described as ‘passionate’ and every villain described as ‘evil’, but I think it’d be an awesome tool in the hands of a good writer...

On second thoughts, what I really need to do is to find that someone else has built this exactly as I want, with minimal effort to me.

Who said all the good ideas were taken? :P


thewhitelily: (Default)
The White Lily

July 2017

2345 678
16 171819202122


RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Oct. 19th, 2017 02:41 pm
Powered by Dreamwidth Studios