Squidesaurus
Jan. 29th, 2009 11:29 pmSo, I've got a functional version 1.0 of Squidesaurus, the relational thesaurus for writers.
The current idea is that, rather than trawling the Internet at large, it uses the full text of books from Project Gutenberg to build up its database of word relationships. Hopefully, eventually, it'll trawl Gutenberg without me having to download and feed in each book separately.
Not sure, as yet, whether it will be actually useful, but it's certainly been an interesting project. I've taken the opportunity to come to grips with WPF and C#, both of which are simultaneously cool and frustrating. The algorithms involved in cleaning up the text, removing stop words, stemming the words (so that "running", "runner", and "runs" gets put into the same box with "run"), and then actually traversing the text searching for word pairs, are quite involved and intriguing.
My test case has been Moby Dick, which I thought would have some nice high-frequency combinations in there. The results are substantially like I expected, which is always a good thing to hear when developing software.
Looking within five words of:
"whale" gives back "sperm", "white", "right", and "great"
"ship" gives back "sail", "whale", "sea", and "boat"
"leg" gives back "ivory", "one", "ahab", and "lost"
Now I've also fed in Twenty Thousand Leagues Under the Sea, Two Years Before the Mast, and Treasure Island. Yes, there's a theme.
Now, within five words of:
"ship" gives back "board", "sail", "crew", and "like"
"sea" gives back "red", "upon", "bottom", and "open"
"water" gives back "surface", "fresh", "salt", and "clear"
"squid" gives back "apparition", "arm", "thing", and "whale"
No results back yet on the 'usefulness' issue, but hopefully, as I feed it more and filter the results to compensate for words that are just plain common, I suspect it should get better.
Even if I do say so myself, I'm rather impressed.
The current idea is that, rather than trawling the Internet at large, it uses the full text of books from Project Gutenberg to build up its database of word relationships. Hopefully, eventually, it'll trawl Gutenberg without me having to download and feed in each book separately.
Not sure, as yet, whether it will be actually useful, but it's certainly been an interesting project. I've taken the opportunity to come to grips with WPF and C#, both of which are simultaneously cool and frustrating. The algorithms involved in cleaning up the text, removing stop words, stemming the words (so that "running", "runner", and "runs" gets put into the same box with "run"), and then actually traversing the text searching for word pairs, are quite involved and intriguing.
My test case has been Moby Dick, which I thought would have some nice high-frequency combinations in there. The results are substantially like I expected, which is always a good thing to hear when developing software.
Looking within five words of:
"whale" gives back "sperm", "white", "right", and "great"
"ship" gives back "sail", "whale", "sea", and "boat"
"leg" gives back "ivory", "one", "ahab", and "lost"
Now I've also fed in Twenty Thousand Leagues Under the Sea, Two Years Before the Mast, and Treasure Island. Yes, there's a theme.
Now, within five words of:
"ship" gives back "board", "sail", "crew", and "like"
"sea" gives back "red", "upon", "bottom", and "open"
"water" gives back "surface", "fresh", "salt", and "clear"
"squid" gives back "apparition", "arm", "thing", and "whale"
No results back yet on the 'usefulness' issue, but hopefully, as I feed it more and filter the results to compensate for words that are just plain common, I suspect it should get better.
Even if I do say so myself, I'm rather impressed.