
In this series, we dive into the world of the Midwest Research Computing and Data Consortium. We explore its members, their challenges, and prospects. Recently, we had the opportunity to connect with Geoffrey Lentner, Lead Research Data Scientist at the Rosen Center for Advanced Computing at Purdue University. Geoffrey was a speaker at the recent Annual Meeting, and he shares his insights on the field in the excerpts below:
I have been computer savvy since I was young. My dad worked on computers in the military, so I was no stranger to computers, but I wasn’t really a programmer. In graduate school I got really into computation and writing code. Most of my fellow astrophysics graduate students were not as excited. They felt like it was a bait and switch, but you can’t do modern science without computation. It’s all data. It’s all computation. I loved it. I spent every waking minute in graduate school, doing nothing but computation.
As I was nearing the end of my PhD at Notre Dame, I decided that research had been the most interesting part for me. At the same time data science as a term was really taking off like a rocket ship, so I jumped on. Then Purdue, my undergraduate alma mater, offered me a position to be a data science person in their high-performance computing group and it felt like a job had been built for me in my hometown. I’ve been here 8 years. I have a research science background and I’m passionate about software. I do as much as I can with things like the Better Scientific Software foundation because my own experience was kind of a mixed bag. Some professors were very reluctant to let graduate students monkey too much with the code because they saw the code as a means to an end instead seeing it as part of the science. Ten years later the scientific community has come around to the idea that science is suffering because of bad software. Good software is good science, and good science is good software—the two are interconnected. I’m passionate about systems, software, and science. I am fortunate that my work is the nexus of those three things. It’s the perfect gig!
The Rosen Center at Purdue is in a transitional stage as we work to be a national computing center. We handle a mix of campus computing needs and those associated with being a national resource provider. I’m on the applications team. Everyone on my team has a background in an area of science that is computational in nature, and now we work as facilitators. If I double click on that word facilitator, my job splits in five different directions.
Consulting: I talk to all the new faculty. I’m the lead campus facing facilitator. I have an outreach campaign every fall when I meet all the new faculty and make sure they are aware of all the cyberinfrastructure that we have available. I advise faculty on good choices where the software hits the metal. There’s part science, part software, part system. We’ve got to span that, so I support them on best practices.
Data management: My specific job title is data scientist. Even though my background was originally in traditional high-performance computing working with large scale simulation in languages like C, my contemporary and industry experience is in data science. That means workflow engineering, data processing and data analysis, machine learning.
Contract work: It could be from a grant, or funds that a research group has available. If they have something they want to build, and the graduate students don’t have the time or the expertise to quite get it home they’ll partner with us for a contracted period.
Outreach engagement and training: I give a lot of guest lectures and invited talks. I help on campus with outreach events, hackathons, and similar engagement opportunities. I’m very engaged in the community and hold the chair position on several national committees.
Applications: The last 20% of my job is all applications. I’m responsible for many of the applications and frameworks on the supercomputer. Working more on the system side with configuration.
Unlike traditional high-performance computing where a group of collaborators writes the whole code that does a big simulation on a supercomputer, more often these days, a lot of people are using supercomputing for workflows. It’s data processing. It’s not one big thing. It’s orchestrating a lot of little things. What we call ‘throughput computing’ is defined by how many things you’re processing and not by how many calculations you’re doing. I like to call it many-task computing.
The lightening talk I gave at the annual meeting was on HyperShell which is a user-facing tool for high-throughput scheduling of small jobs. Essentially, there is a large community of users in a few different areas such as bioinformatics or astronomy, for example, who are trying to perform something rather modest in size. They just have a lot of these jobs to do and need to orchestrate all the pieces.
About 5 years ago we were working on a contract project and the group needed to send little tasks from edge computers that were taking pictures and then processing them with an AI model. It only ran under Windows, and it needed to be Linux for the supercomputer. So basically, I needed to bring up a cluster of Windows virtual machines and accept little bits of work over a queue that I wired up from inside of the VM to Raspberry Pis running in a conference room. The operation needed to be real-time and nothing out there existed to do it. It wasn’t hard. You just needed a little bit of Python code to make it work.
The process worked well, and the research group was thrilled with it. Over the course of a few years, there were just more occasions where something similar was needed to meet the needs of other research groups. Over time it just evolved. We made the choice that this is worth doing and went back to the drawing board to build it from scratch the right way. With HyperShell, the user gets their own private scheduler and a lot more throughput. It’s also got a rich feature set that lets researchers track their data. For example, if you are a bioinformatics person, with a lot of sequence analysis to do then you can tag them. You can then look up a sequence with specific genes you ran 5 months ago.
The first is very high level for young people who are still in college. They can be concerned that if they don’t get the perfect opportunity at a critical moment then their whole career, their whole professional life, is just over and that’s just not how it is. If you don’t get into the specific graduate school, internship, or role in the company that you want—there is always tomorrow. There’s always the next thing. I switched from English to physics, and then went to industry, and then came back. The reality is, I love the job that I do. I can’t imagine doing anything with more impact or that was more fun. You never know where you’re going to go. If you don’t get the perfect thing right out of the gate, there’s always something around the corner, and it doesn’t mean you can’t try again.
Second, is more specific advice for those looking to get into research computing and data. You know, people in my position are still trying to figure out the right professional track to create a pipeline of new workers. I don’t think anybody really agrees, but honestly, I’m happy with how me and all of my colleagues got into this field. Pick something that is research-driven where computing is involved. This could be in science, in engineering, even in the humanities. For example, I consulted with somebody doing deep learning for archaeology and it’s really interesting. Also, I think what is sorely lacking in a lot of young people is the willingness to dig into system level stuff. You can just do things. You can learn how the system works. You can dig deep into Unix. You can learn more than one programming language. People are always asking me what programming language to learn. You should learn all of them! I know seven because they’re all interesting, and they all offer something unique. Not every project requires the same thing. Everything I know that is important and valuable includes things I did not learn in a class. Just go learn stuff!