Go to content Go to menu

filtering out gibberish

Apr 20, 09:53 PM

So, this is a bit of a long story. As a background, one of my projects is building a DSpace repository for 3D architectural models. These models are stored in a format that is unique to the tool the researchers use, which means they are big binary files. There is interesting text data in these files, so a full text index would be helpful for people searching through them all, to find something they might have missed. I mean, we’ll have good metadata for these files, but not all data is surfaced by even the best metadata. The typical approach for DSpace repositories is to convert the data formats to something DSpace can index, like a PDF, or a CSV file. But these data files are not the kind of thing you just convert. And no text extraction tool exists for this data format. Until today. :-)

I’ve been thinking about this problem for a few months. I’ve even run the files through the very handy strings program. Unfortunately, strings is very exhuberant and pulls all kinds of gibberish out of a binary file. So… not very useful for a full text index. I shelved the idea. Until yesterday.

Yesterday, during a meeting, I had occasion to talk about a tool I’ve wanted to play with for a long time, the Data Science Toolkit I pulled up the URL so I could talk meaningfully about it, but didn’t get to make more than a passing mention. After the meeting, before closing the tab I had open, I noticed the Text to Sentences API for the DSTK. I tinkered with it. It had never occurred to me that you could filter out gibberish and retrieve meaningful text. This was the missing piece! I poked around a bit, looking for code, but, ran out of time to do any more investigation, so I instead shouted out to the void (i.e. I tweeted):

and, amazingly, I got a response:

So, I grabbed a copy of cruftstripper.rb and un-comented the last three lines renaming the script “sentences”, pointed strings at a data file and piped that gibberish through my new “sentences” script, voila:

strings big-data-file.3d | sentences

And got useful data, which I can’t share with you because I don’t have permission to do so. But, you’ll have to trust me, it’s beautiful.

And the really cool thing about this approach is, it works for any mysterious data format a repository might receive.

Now, what I have left to do is to wire this mess up into a media filter script for DSpace which will be pretty easy. The tricky bit will be figuring out a way to do it so that the code can be merged into the DSpace application by default. Because that’s what I really want to happen here—give DSpace the ability to produce a full text index from any file.

I have very recently gone through the hassle of changing ISPs at home, because DSL just wasn’t reliable enough (or fast enough) for my working at home needs. However, in all the troubleshooting I had to do to try to get DSL to behave more reliably for me, I made an interesting discovery: setting the MTU on your network interface on your computer can have a profound impact on its reliability. Before I get into the details, trust me on this, figure out how to set your MTU and set it to lower than the default (which is usually 1500).

I am not a network professional, so please forgive my very basic grasp of the facts, but, from my understanding, the MTU is the size of the information packets your computer sends through your computer’s network interface. The default MTU is 1500, which is an ideal, “everything is working great” number. Now, the rest of the network upstream from you can split those ideal packets into smaller sizes, to get them to where they need to go. Splitting them up creates a burden on whatever is doing the splitting, it has to keep track of all the pieces, and put the responses back together, before it passes them back to your computer.

So, anyway, if you set your MTU to a lower value (there are ways of figuring out the ideal number)—in my case I set it to 1428—you increase the reliability of your network connection. In practice, I’ve seen dramatic improvements for previously un-usable free wifi access points. Like the gym where my son goes for his nerf gun battles. Or the neighborhood pool.