Go to content Go to menu

This past June I went to Open Repositories 2017 in Brisbane, Australia. I presented a workshop on how to get started with using Ansible and Serverspec. The workshop slides and materials are available. That was a great experience, and I’m happy to report that I was able to survive the virtual machine I brought to lead the workshop not working on my own notebook… thanks to the help of my pal, Kim Shepherd, who loaned me his own notebook, to run the machine. It did work for everyone else. When I returned from Australia, I rewrote the Vagrant configuration to make a sturdier VM for future workshops, calling it Workshop-o-matic. It is useful for anyone who might want to follow along with the workshop slides, so, if you’re interested, please do.

This blog post is very tardy, I’m sorry about that. Immediately after the conference, I took my wife on a vacation around northern Australia. We spent a wonderful week driving around in a rented campervan. It was a great time, and it has been a non-stop whirlwind to catch up with work and life, and everything else that piles up after a vacation, and at the start of a new school year. So, anyway, enough excuses, on with this recap.

The Keynote was given by Sir Timothy Gowers, entitled, “Perverse incentives: how the reward structures of academia are getting in the way of scholarly communication and good science.” There is a video recording. Incidentally, all the filmed conference sessions are also available (note only the general session tracks in the main ballroom were recorded).

Two things Sir Timothy said really struck a chord with me, have held my attention through the conference, and since, as well:

The current culture doesn’t really favor sharing [incomplete ideas].

An obvious thought is that if we did all start sharing our little scribblings, we could end up with a complete mess.

After hearing this, my mind started racing, becuase, us open source developers do exactly this: we already share our work in progress. And it hit me, I’d been thinking about this problem a while. I even had a phrase for my half-baked idea of how to approach the problem:

Are you going to eat that, mate?

And I filled a page with scribbly notes, and found a team to pitch this idea as part of the idea’s challenge. Alas, it didn’t win, but we had great fun making the slides.

So, my main takeaway from this conference is that I need to figure out how network data analysis works, and I need to tackle this challenge on my own, because I’m convinced there are a lot of really great ideas—almost finished code—just out there on GitHub, waiting for us to find, and ask that question: Hey, if you’re not using this code, can we use it?

Pardon this digression, however, after the conference, my pal Kim sent me a note on Slack, and says he was fiddling with a citation database and ended up finding this article:

MODELING DISTRIBUTED COLLABORATION ON GITHUB Journal Article published Dec 2014 in Advances in Complex Systems volume 17 issue 07n08 on page 1450024 Authors: NORA McDONALD, KELLY BLINCOE, EVA PETAKOVIC, SEAN GOGGINS

And an author name leaps out at me: Sean Goggins, hey, I think I know that guy. We have shared friends, we go to the same neighborhood pool. So, we become facebook friends. I still haven’t taken Sean out to lunch, but he’s working on this really interesting project:


From the governance page, the mission of the project is to:

  1. produce integrated, open source software for analyzing software development, and definition of standards and models used in that software in specific use cases;
  2. establish implementation-agnostic metrics for measuring community activity, contributions, and health; and
  3. optionally produce standardized metric exchange formats, detailed use cases, models, or recommendations to analyze specific issues in the industry/OSS world.

Which is not quite what I want to do, but it is working with the same data set, to help foster the health of open source development communities. And goal 3 would at least help me in my own goal, which is essentially to build a recommendation engine for work in progress on GitHub.

Now, back to my OR17 recap. Here are some of the cool tools I found out about at the conference, the things I want to check out later:

In Dev Track 2, Conal Tuohy presented on mining linked data from text, and mentioned a tool I want to check out: XProc
which is an W3C recommendation for an XML transformation language, using XML pipelines. There’s a book and a tutorial I found, I’ll check them out later.

Also, Peter Sefton presented a static repository builder tool, Calcyte which makes extremely high-performance and inexpensive data repositories with static HTML.

The real draw for me for this session was the presentation Visualizing Research Graph using Neo4j and Gephi by Dr. Amir Aryani and Hao Zhang. I knew after the keynote that I really needed to find out more about graph data, and I knew Neo4j and Gephi would be tools I’d need to be familiar with, so I ended up in Dev Track 2 to see this presentation and be inspired, and it did not disappoint. I came out of this presentation convinced that I could use the network graph data in GitHub to build the recommendation engine I wanted to build. And, even if I didn’t build a full-fledged tool, at a minimum I should be able to explore this data on my own, using Neo4J and Gephi.

I’m not ashamed to admit that I sought out Dr. Aryani’s next presentation on the following day, in General Track 10, “Research Graph: Building a Distributed
Graph of Scholarly Works using Research Data Switchboard”. It was really interesting to find out how a distributed graph works, and why one would use it—It’s a way to produce a larger data set from more than one shared dataset, by connecting the graph data across disparate repositories. Doing so allows each parter institution to retain “ownership” of their own data, while still maintaining access to the shared whole of the larger dataset. Distributed graph databases also share a bit of the computational load of running large-scale queries, which helps the entire data set scale, and remain usable.

My other takeaway from OR17 is that I really need to keep better tabs on a particular colleague of mine, Andrea Schweer, as she often puts interesting code up on GitHub, and every time I look at her code I’m blown away by its quality, and how I can immediately make use of much of it. Don’t believe me? Look at this collection of cool stuff.

That’s the kind of thing I hope to be able to find with my future fiddling with the GitHub network graph. How many other developers have huge collections of interesting bits of code, maybe just half-finished, but still amazingly useful, waiting for us to discover, and use, and build communities around?

I’m really excited about this, and hope to be able to help make it happen.

UPDATE (12/15/2017):
Just to give you a tiny taste of what’s possible, GitHub has added a couple of recommendation-engine features. If you have a GitHub account, head on over to GitHub Discover and GitHub Explore which are both giant rabbit holes of fun, happy hunting! NOTE: neither of these features are what I had in mind, they’re just basic “you like these projects and follow these people, have you seen this projects?” or “hey, everyone else is excited about this, you should be, too” kinds of things. I’d like to focus in on branches in forks of a project, find the ones that have been pulled a lot, and mix that in with other social data (friends of friends, etc.).

UPDATE (01/16/2018):
Kim Shepherd wrote a fun song inspired by events at OR17 Two warnings: there’s a bit of NSFW language in the middle, and this probably makes more sense if you were there. But, it’s a good song and in good fun, so give it a listen.

UPDATE (02/07/2018):
Ooooh, this looks fun

OR2016 Dublin Recap

Jun 20, 08:26 AM

My friend Adam Field tweeted this series of word clouds based on the tweets mentioning OR2016 Dublin. Whatever your feelings about word clouds might be, these are pretty spot-on as far as simple summaries of the experience, so I’m going to start off with these images:


This year, I have been struck several times by the fact that I managed to make it through the conference without experiencing the feeling that my brain was mush. This year… I don’t know, maybe I’ve finally settled into this mental space… I suppose part of it is that I participated as a reviewer for OR16, and I ended up attending most everything I reviewed. So, I had longer to digest what it was that was being covered, and there were fewer real surprises for me.

There will be screencasts

But, that’s not to say there were no surprises. Here’s one big one: I’ve now been utterly convinced that recording and making public screencasts is no big deal. The important thing is to just start, and to provide an index to each video, and collect them all in a wiki (including the index with links to sections of the video). Just start, and make useful videos, and keep doing it. Thanks, Adam.

Robots are seriously bad news for usage stats, and there are easy things we can do about that right now

The paper title was I Can Haz Robot and was about robot detection/filtering from usage stats… my initial reaction before even skimming the abstract was, “I think we’ve solved this problem?” Boy was I wrong. The evidence is pretty damning, and was very thoroughly presented by Joseph W. Greene, from University College, Dublin. There is lots of work to do here, but I should at least give an occasional look for the obvious robots (the top users/downloads reports are a great place to start) and then filter those out using the mechanisms provided by DSpace (or whatever platform you might use). It’s not enough, but it’s work that is worth doing. Joseph has an article in the works that will cover this topic in depth, scheduled for publication in July 2016, in the journal Library Hi Tech (deposited but currently under embargo, in Research Repository UCD — the embargo lifts on 8/1/2016 ). Slides for this talk will likely be posted on the OR2016 site .

Node.js is worth exploring (especially for simple tasks)

Jared Watts from the University of Auckland, New Zealand, co-presented a “Daring Demo” about Microservices with my friend and fellow DSpace committer Kim Shepherd, and we sat together at the conference dinner. During dinner, Jared made the case for using Node.js for simple one-off tasks and projects, which is a task space for which I have been using Ruby the past few years, with the idea of “It’ll be good for me to know Ruby better.” I’m convinced, I think Node.js may help deliver on simple projects much faster than Ruby, especially after watching their discovery service demo.

Automated code review will make you look less dumb, and is worth checking out

I kept running into Jeremy Prevost from MIT, including at the Dublin airport on the way out of town. Our last chat was the most memorable, we talked about collaboration and technical debt, and then Jeremy advised that I look into CodeClimate which is an automated code review tool, it’ll check your commits before you push them, and give you the chance to deal with a mistake before someone calls you on it. Sounds great to me, I’m just bumbling my way along here. :-) But, alas, Codeclimate doesn’t work with Java. However, Codacy does, and I intend to play with it.

Speaking of collaborators, here are a few of my favorites: a group photo of some of the DSpace committers

Pictured above, from left to right: Graham Triggs, Hardy Pottinger, Tim Donohue, Andrea Schweer, Pascal Nicolas Becker, Ivan Masár (aka Helix84), Andrea Bollini, Terry Brady, Kim Shepherd, and Richard Rodgers. Unfortunately the committers from @mire had left the room when we decided to take this picture, which makes me a little sad. Next year we’ll get them in the picture, I promise.

DSpace 6 is a little late and people don’t mind; DSpace 7 is going to be awesome

Tim Donohue, the Technical Lead for DSpace, brought this bit of news: several people had expressed relief that DSpace 6 has not yet been released. This counts as a minor surprise. Tim also demonstrated what will likely be the new UI in DSpace 7: ngUI (I think Richard Rodgers coined the name, but it fits, ng is what Angular 2 calls itself), the Angular2-based extended prototype is a work in progress. It’s being built in an agile way, using a Waffle.io board to manage work. I’m hearing a lot about Angular2 (and Angular) from other developers (not just DSpace devs), I think basing the out-of-the-box UI for DSpace 7 on Angular2 is a fantastic choice, and will lead to some really fun repository experiences down the road. I’m looking forward to working with it. Here are the slides from Tim’s presentation on the new UI prototype video recording of one of these sessions should be available soon.

Vagrant Up!

We had a room change and I don’t think the recording equipment made it into the room, so there’s no recording of the Vagrant Up session I volunteered to chair. I blew past the 8 minutes slotted for my demo, which is part of why I’m so enamored with the idea of doing some screencasts… I want a do-over! Luckily, there were four other people in the room to take up my slack: Alicia Cozine from Curation Experts, Nick Ruest from York University, Liz Krznarich from ORCID, and Francis Kayiwa from Virginia Tech. I think between us all we covered all aspects of Vagrant, and a good bit of general provisioning concerns. I’m honored to have been able to present with these fine folks.

On the horizon: mix and match: annotations, IIF integrations, re-usability and related services, machine interfaces to data

It’s pretty clear from the presentations this year that repositories have moved past the “Data is coming! get ready for the data!” stage to the “Let’s do something interesting with all this data!” stage. Stanford is blazing the trail, which does not count as a surprise to anyone, and I found their paper, Value-added services to garner repository adoption to be way more thrilling than the title. Honestly, they are doing amazing stuff, and I want to swipe all their code. They have presented on Spotlight , their exhibit-building service, at past ORs. New to me is their Embed Service which lets them turn over the exhibit building to other sites/services, and focus on just delivering bitstreams with an embedded viewer. It’s really slick, and I intend to play with it. This presentation wasn’t the only place I heard the Rufus Pollock quote ‘The best thing to do with your data will be thought of by someone else.’ but it was the first place I heard it this year. Jack Reed, from Stanford, added “Don’t just create an open repository, let’s build open services around it.” One of the coolest things I saw in this presentation is how the data set in their repository was feeding into all kinds of related projects. And the derivative visualizations had the capability of rebuilding if a new version of the data was loaded. That’s the sort of thing that makes my head tingle a bit.

Speaking of head tingling, Dr. Peter Sefton from University of Technology, Sydney, Australia, presented on this amazing Data Arena they built, which is a platform for storing and displaying data visualizations. It’s a really cool mash-up of version control, data science, repositories and Hollywood.

Peter also presented on Ozmeka which is a fork of Omeka, re-tooled to function more like a repository platform. In a quick demo, he showed off a really slick way to import data into this repository via a CSV, which is pretty standard. What was really interesting was that he included in this CSV file an outline of the data model (collections, etc.) to fit the Portland Common Data Model, and then showed how this same CSV file could also feed into a Fedora 4 repository. It’s a useful concept, and one DSpace users might be able to borrow, since the PCDM is roughly the same as the DSpace data model.

OK, that’s a quick first draft of my thoughts. I reserve the right to make additions and changes, and I plan to add more links to presentations as time goes by. If video recordings are posted, I’ll link them here, too.

I’m especially interested in re-watching the opening and closing keynotes again. I have notes, but I want to re-watch them before I try to say anything about them.

Hmmm… maybe my brain really is mushy, I just haven’t noticed yet?

Ideas Challenge

I almost completely forgot, Adam Field was relentless in recruiting entrants for the Ideas Challenge, and I did join a team, with Grant Denkinson, from University of Leicester, UK, and Roeland Dillen, from Atmire. We didn’t win, but I think our idea, which includes the notion of a post-ingest workflow, will end up being part of DSpace (see DS-3247 ), because it’s a pretty great idea… Getting stuff into the repository is most definitely not the end of the process.

Other recaps

As I find them, I’ll add links to other recaps of OR2016 here. First up is George Macgregor’s ‘EPIC’ blog post … George went to a few of the same sessions I did, and he has written a very detailed analysis, backed up with lots of links.


Videos from OR2016 are now getting posted, I’ll link to a few sessions I attended which I think are worth checking out.

  • Ozmeka, a repository before breakfast I mentioned this one above as well… it’s a cool little “scratch pad” repository application, and a nice demo of how to transform a simple CSV file into something far more complex with linked data.

I’m hoping the keynotes from Laura Czerniewicz and Rufus Pollock will be posted soon, they were both wonderful and thought-provoking, I think they deserve a much wider audience. I will link to them as soon as they are available.