Go to content Go to menu

filtering out gibberish

Apr 20, 09:53 PM

So, this is a bit of a long story. As a background, one of my projects is building a DSpace repository for 3D architectural models. These models are stored in a format that is unique to the tool the researchers use, which means they are big binary files. There is interesting text data in these files, so a full text index would be helpful for people searching through them all, to find something they might have missed. I mean, we’ll have good metadata for these files, but not all data is surfaced by even the best metadata. The typical approach for DSpace repositories is to convert the data formats to something DSpace can index, like a PDF, or a CSV file. But these data files are not the kind of thing you just convert. And no text extraction tool exists for this data format. Until today. :-)

I’ve been thinking about this problem for a few months. I’ve even run the files through the very handy strings program. Unfortunately, strings is very exhuberant and pulls all kinds of gibberish out of a binary file. So… not very useful for a full text index. I shelved the idea. Until yesterday.

Yesterday, during a meeting, I had occasion to talk about a tool I’ve wanted to play with for a long time, the Data Science Toolkit I pulled up the URL so I could talk meaningfully about it, but didn’t get to make more than a passing mention. After the meeting, before closing the tab I had open, I noticed the Text to Sentences API for the DSTK. I tinkered with it. It had never occurred to me that you could filter out gibberish and retrieve meaningful text. This was the missing piece! I poked around a bit, looking for code, but, ran out of time to do any more investigation, so I instead shouted out to the void (i.e. I tweeted):


and, amazingly, I got a response:


So, I grabbed a copy of cruftstripper.rb and un-comented the last three lines renaming the script “sentences”, pointed strings at a data file and piped that gibberish through my new “sentences” script, voila:

strings big-data-file.3d | sentences

And got useful data, which I can’t share with you because I don’t have permission to do so. But, you’ll have to trust me, it’s beautiful.

And the really cool thing about this approach is, it works for any mysterious data format a repository might receive.

Now, what I have left to do is to wire this mess up into a media filter script for DSpace which will be pretty easy. The tricky bit will be figuring out a way to do it so that the code can be merged into the DSpace application by default. Because that’s what I really want to happen here—give DSpace the ability to produce a full text index from any file.


Server performance has been on my mind lately, specifically Solr server performance. We use Solr as our search and browse infrastructure for MOspace, our DSpace -based institutional repository. I’ve heard that search performance on MOspace has been slow, so I wanted to get some data on that, and then find ways to improve the performance. I discovered that our Solr log files contain information on how long it takes the server to respond to a query. Which is great, I just needed a way to analyze that data. I’ve been meaning to play with Miller — which is a handy little text parsing tool, with a little bit of data analysis added in. It’s a lovely tool, but I couldn’t figure out the syntax to do what I wanted. Normally, when I run into this issue with open source software, I’ll reach out on a related mail list, and ask for help. Miller is new, it doesn’t have a community yet, and no mail list. So, I sent an e-mail to John Kerl, the developer of Miller. Astoundingly, John responded right away, with a little snippet that turned into this:

for f in solr.log*; do echo "Processing $f..."; grep search $f | sed 's/.*} //' | \
mlr --repifs --ifs space stats1 -a min,mode,mean,stddev,max -f QTime; echo; echo; done

Which outputs something like this:

Processing solr.log...
QTime_min=0.000000,QTime_mode=1,QTime_mean=8.794904,QTime_stddev=83.023710,QTime_max=6380.000000

Processing solr.log.2015-08-26...
QTime_min=0.000000,QTime_mode=3,QTime_mean=15.659905,QTime_stddev=87.355819,QTime_max=5350.000000

Processing solr.log.2015-08-27...
QTime_min=0.000000,QTime_mode=3,QTime_mean=10.005723,QTime_stddev=59.013046,QTime_max=2557.000000

Processing solr.log.2015-08-28...
QTime_min=0.000000,QTime_mode=3,QTime_mean=9.215227,QTime_stddev=51.547102,QTime_max=2692.000000

Processing solr.log.2015-08-29...
QTime_min=0.000000,QTime_mode=3,QTime_mean=8.626997,QTime_stddev=52.084964,QTime_max=2688.000000

Processing solr.log.2015-08-30...
QTime_min=0.000000,QTime_mode=3,QTime_mean=8.197308,QTime_stddev=50.787355,QTime_max=4718.000000

Processing solr.log.2015-08-31...
QTime_min=0.000000,QTime_mode=3,QTime_mean=8.253708,QTime_stddev=106.475814,QTime_max=9781.000000

Which is kinda cool. But those max response times! I googled around a bit, and found this StackOverflow thread

Which in turn lead me to this wiki page and this wiki page

That last page mentions these settings for tuning the GC to maximize Solr performance:

JVM_OPTS=" \
-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

This is info I can use. So, I tried those OPTS in our Tomcat server’s setenv.sh script:

export CATALINA_OPTS="$CATALINA_OPTS -XX:+UseG1GC"

# suggestion from http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded
# and here https://plumbr.eu/outofmemoryerror/gc-overhead-limit-exceeded
# though, really, we ought to figure out where the memory leak is a some point...
export CATALINA_OPTS="$CATALINA_OPTS -XX:-UseGCOverheadLimit"

# This is interpreted as a hint to the garbage collector that pause times
# of <nnn> milliseconds or less are desired. The garbage collector will
# adjust the  Java heap size and other garbage collection related parameters
# in an attempt to keep garbage collection pauses shorter than <nnn> milliseconds.
# http://java.sun.com/docs/hotspot/gc5.0/ergo5.html
export CATALINA_OPTS="$CATALINA_OPTS -XX:MaxGCPauseMillis=250"

# A hint to the virtual machine that it.s desirable that not more than:
# 1 / (1 + GCTimeRation) of the application execution time be spent in
# the garbage collector.
# http://themindstorms.wordpress.com/2009/01/21/advanced-jvm-tuning-for-low-pause/
export CATALINA_OPTS="$CATALINA_OPTS -XX:GCTimeRatio=9"

# more tweaks for Solr performance optimization, from
# http://wiki.apache.org/solr/ShawnHeisey
export CATALINA_OPTS="$CATALINA_OPTS -XX:+PerfDisableSharedMem"
export CATALINA_OPTS="$CATALINA_OPTS -XX:+ParallelRefProcEnabled"
export CATALINA_OPTS="$CATALINA_OPTS -XX:G1HeapRegionSize=8m"
export CATALINA_OPTS="$CATALINA_OPTS -XX:InitiatingHeapOccupancyPercent=75"
export CATALINA_OPTS="$CATALINA_OPTS -XX:+UseLargePages"
export CATALINA_OPTS="$CATALINA_OPTS -XX:+AggressiveOpts"

And… so far… so good. I can see we are having far fewer big GC events by watching our system profiler. We are having many more small GC events.

Today, I’ve been playing a bit more with Miller, spefically grouping and sorting, so I can get a more fine-grained picture of query response, with the tuning noted above in place.

for f in solr.log; do echo "Processing $f..."; grep search $f | sed 's/.*} //' | \
mlr --repifs --ifs space stats1 -a min,mode,mean,stddev,max -f QTime -g hits then sort -nr hits; echo; echo; done

Which gets me lots of info like this (grouping by hits, reverse-sorting by hits):

hits=18383,QTime_min=5.000000,QTime_mode=5,QTime_mean=5.000000,QTime_stddev=,QTime_max=5.000000
hits=18283,QTime_min=3.000000,QTime_mode=8,QTime_mean=40.402174,QTime_stddev=42.019111,QTime_max=137.000000
hits=18282,QTime_min=2.000000,QTime_mode=6,QTime_mean=18.181592,QTime_stddev=41.169189,QTime_max=577.000000
hits=17403,QTime_min=6.000000,QTime_mode=184,QTime_mean=501.666667,QTime_stddev=816.713577,QTime_max=2068.000000
hits=17043,QTime_min=1.000000,QTime_mode=4,QTime_mean=3.711864,QTime_stddev=1.712538,QTime_max=9.000000
hits=17042,QTime_min=1.000000,QTime_mode=2,QTime_mean=2.278986,QTime_stddev=1.984960,QTime_max=29.000000
hits=17025,QTime_min=1.000000,QTime_mode=2,QTime_mean=3.378571,QTime_stddev=1.990871,QTime_max=11.000000
hits=17024,QTime_min=1.000000,QTime_mode=2,QTime_mean=2.060000,QTime_stddev=1.228379,QTime_max=14.000000
hits=15436,QTime_min=2.000000,QTime_mode=3,QTime_mean=20.684211,QTime_stddev=40.519906,QTime_max=235.000000
hits=14999,QTime_min=5.000000,QTime_mode=6,QTime_mean=8.500000,QTime_stddev=5.686241,QTime_max=17.000000
hits=14878,QTime_min=4.000000,QTime_mode=5,QTime_mean=18.888889,QTime_stddev=25.852681,QTime_max=78.000000
hits=14868,QTime_min=3.000000,QTime_mode=3,QTime_mean=3.750000,QTime_stddev=1.164965,QTime_max=6.000000
hits=11165,QTime_min=7.000000,QTime_mode=7,QTime_mean=490.000000,QTime_stddev=667.476142,QTime_max=1533.000000

I can then see that there might be certain queries that are taking an unusually long time (mode of 184, minimum of 6). I will see if I can find out what that query is and why it’s taking so long.

Oh, I should note that John Kerl, the developer of Miller, is right now asking for people to make suggestions for his FAQ page. Miller is awesome, if you have a text parsing or data analysis task you think Miller might be just the tool for, post your challenge to that issue. Someone might help you answer your question.

10/28/2015 UPDATE: it was brought to my attention that the Lucene folks have posted a rather amusing warning against using the G1GC garbage collector and similarly, the Elastic Search folks have done the same (though it’s likely just an echo of the Lucene warning). Now, I don’t know what to make of this, but I do know that search performance has measurably increased since switching to the G1GC. For now, I think I will choose to quietly worry.