Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: COBOL-REKT, a toolkit for analysing and reverse-engineering COBOL (github.com/avishek-sen-gupta)
91 points by armorer on Aug 15, 2024 | hide | past | favorite | 49 comments
This is an evolving toolkit of capabilities helpful for analysing and reverse engineering legacy Cobol code. Currently, the following capabilities are available:

- Program / Section-level flowchart generation based on AST (SVG or PNG) - Parse Tree generation (with export to JSON) - Control Flow Tree generation (with export to JSON) - Allows embedding code comments as comment nodes in the graph - The SMOJOL Interpreter (WIP) - Injecting AST and Control Flow into Neo4J - Injecting Cobol data layouts from Data Division into Neo4J (with dependencies like MOVE, COMPUTE, etc.) + export to JSON - Injecting execution traces from the SMOJOL interpreter into Neo4J - Integration with OpenAI GPT to summarise nodes using bottom-up node traversal (AST nodes or Data Structure nodes) - Exposes a unified model (AST, CFG, Data Structures with appropriate interconnections) which can be analysed through [JGraphT](https://jgrapht.org/), together with export to GraphML format and JSON. - Support for namespaces to allow unique addressing of (possibly same) graphs - ALPHA: Support for building Glossary of Variables from data structures using LLMs - ALPHA: Support for extracting Capability Graph from paragraphs of a program using LLMs - ALPHA: Injecting inter-program dependencies into Neo4J (with export to JSON) - ALPHA: Paragraph similarity map

Contributions / use cases are welcome!



There’s actually a lot of academic work around this from the 1990s; static analysis, reverse engineering, business logic extraction, re-engineering. All leading up to Y2K. There were quite a few commercial applications too. That all fizzled out after January 1, 2000 though.


I'm hoping it makes a comeback in the lead up to Y2038, heh


Here's a crazy idea (and possibly a job opportunity for someone?)

If someone built a tool to translate the AST generated by this into one of these newer theorem-proving dependently-typed languages (examples: Idris/Idris2 come to mind, but also the Coq/Rocq theorem prover, Agda, Lean), would it be theoretically possible to not only translate this code into a newer language but also suss out bugs and literally prove correctness? (Given how important some of this COBOL code seems to be, such as at Medicare)

I know that one of the risks of changing the language that logic and computation is written in is unexpectedly changing the behavior or introducing new bugs; wondering if this might mitigate or almost entirely prevent that


Fully automated? I don't think so, and it falls apart suprisingly quickly. To give an example:

All variables (well, in "classical COBOL" at least) are global variables. Since memory is constrained, a really common idiom is to have great swathes of punned, overlayed variables; in C terms it would be unioned structs. Subroutine A would have a var A-TEMPS divided into A-TEMP-VAR1 and A-TEMP-VAR2. Since routine B isn't on the same call path, that _same area_ could be also divided into B-TEMPS, B-TEMP-X, B-RESULT, and C-MOVEIN (because hey, code got changed). When you port this mess to Java you can (a) emulate unions with mind-boggling complex , huge, and fragile idioms, (b) tease out the actual code flow graphs and intent, or (c) some combination.

And no, automated doesn't go very far for (b); although the computer might be able to figure out that March doesn't have a leap day and so this path won't execute because of that, it _is_ the end of a quarter and so has an artificial closing-day tacked on _if_ this is for subsidiary Foo, but not Bar because they have a 0-day on the following month. Too many combinations to exhaustively compute, and requires a lot of human smarts to prune possibilities.


It's not a crazy idea. For example, Amazon's BluAge offering does automatic translation. However, frequently, forward engineering teams do not want a 1:1 translation of the code, because that might end up reproducing the same system organisation of the original Cobol base (in a modern language), and engineers/architects usually want to work on a new design (while maintaining the original domain logic). So far, this library does not step into the forward engineering territory, and tries to merely provide useful information/artifacts, which could help the reverse engineering teams move (hopefully) faster, but obviously there is a lot of experimentation / inference that can be done on top of the extracted data.


Depends - sometimes/often the goal is just to bring it into a modern language to improve maintainability (i.e. its an easier hire) rather than to do a ground-up rewrite.

Especially as ground-up rewrites are often risky, and bringing it into a modern language might make it easier to incrementally improve/refactor over time.


When people say "bring this cobol code to a new language to improve maintainability", they don't just mean the syntax. Any C developer can learn cobol syntax in 10 minutes. They mean things like use functions instead of gotos, don't use just global variables, don't depend on lots of weird tooling from the parallel world of IBM mainframes, with most of your logic hidden in weird batch scripts with names like ABC@@DEF.

I could write a cobol to c translator in a weekend. Nobody would buy it. Source: I've spent the last year and a half consulting on a huge project to rewrite a cobol codebase.


So basically the real work is not syntax translation (which a machine could do, albeit probably poorly) but semantics translation (which requires a deep understanding of both languages as well as what the code's intent is), so that you can do things like replace goto's with function calls without breaking expected behavior given inputs.

A similar problem to translating procedural code in a language with mutable variables to functional code in a language with immutable variables. A lot of old functions in C etc. were expected to modify their passed-in arguments, for example, which would be a no-no today (note: not in the C space, I'm currently an Elixir dev, but I'm hoping that's now frowned upon!)

I've noticed that LLM's are still not very good at this, too.

I'm unfamiliar with COBOL but I'm currently looking for work; not sure if you'd be up for a conversation just to discuss your work since (for some odd reason) I enjoy refactoring (as well as software preservation and validation); at the very least I'd probably be a decent rubber-duck if you got stuck on something lol


Reading COBOL and translating its behavior 1:1 is maybe 1% of my job. There is so much more to modernising a bank. Technologically and culturally.

I can discuss my work, but unless you have a work permit in Israel, I won't be able to hire you.


Ha, now I get the username!

Without getting into specifics, is it rewarding work? I can probably not do terrible at the 99% part


To get the username you'd need to be familiar with at least the old Hebrew translation of Lord of the Rings and my mother :)

I'm enjoying tech-leading this project very much. I get to spend the time needed to write unusually high quality code, and there's a wide funnel that collects interesting debugging challenges and gets the best ones to my desk.

Working with a bank is... not for everyone. Not for me ten years ago, for example.


Is there any approach other than extensively documenting both the intent and inner workings of the system and then doing ground up rewrite?


There are a hundred approaches for as many big banks with mainframe cores. The problem of modernizing a bank is much, much bigger than what language its core logic is written in.

We're talking about a custom software stack grown from a first version written fifty years ago, before the word "coupling" meant anything to programmers.


Is the project you’re on a rewrite or a rehosting? Do you think an open source cobol jcl cics rehosting platform could gain traction?


Rewrite.

I don't think I know anyone that would buy one, paying IBM is not the main issue with mainframes, and having support with big pockets is nice.

But I only work in one corner of the industry.


There was a product in the 90s that ran on the PC and did static analysis on COBOL programs. I can't remember the exact name of it, something like renew-something-or-other. It had a query language where you could follow either the possible control flow or data flow from one point to others (or to a point from earlier ones).

The only thing I've used like it recently was OQL (Object Query Language) for querying the Java heap.

I remember Intellij had some static dataflow analysis and I do miss it working in RubyMine.


How about compiling Cobol to machine code, and then using an LLM to decompile to <your source language of choice>?

This moves the focus from Cobol specific tools to Cobol agnostic tools.


Not sure what compiling it first gets you, but IBM had basically the same idea:

https://news.ycombinator.com/item?id=38508250


Kinda surprised this isn't an IBM tool. I suspect they could make a killing consulting/watsonXing with this.


So obviously there’s been a lot of legacy COBOL kicking around, but is this still the case? Would a new COBOL project have been started in the last 20 years? I kind of imagined that Java (or at least the JVM) has eaten its lunch.


A number of US federal agencies still have astonishing amounts of it. The world’s largest insurer, Medicare, uses 10M+ lines of COBOL to process the claims it receives — total dollar amounts that make up 3% of the yearly GDP.

Maintaining and modernizing these critical systems is important work.


My (non-bank) employer still has lots of COBOL in production, and its constantly extended. Before working here I expected all COBOL to be running on some large IBM mainframe, but no. It's x86-64 Windows -- COBOL compiled to native Windows binaries.

Edited to specify not a bank.


As someone doing support and occasional code changes for a pile of vb6 that doesn't sound that bad. If you need a code base to be stable for decades COBOL beats vb6.


Is there some advantage to using COBOL or did they just have an old COBOL programmer who just kept using it?


I'm not privy to the details as a sysadmin but based on observation its either an application requirement or legacy foundational codebase carried forward -- probably a combination of both.


Every single attempt to replace COBOL has ended in failure. COBOL will never die. You can move it to new platforms, but it can never be replaced.


That's not really true, lots of places of replace it with newer stuff.


Yes, there's a lot of it kicking around. This evolved from a personal testbed to iterate on ideas to apply on actual legacy code modernisation work I've been involved in.


So, is the color of your bank's logo blue, green, red or orange?


I worked for two very big banks in europe and we have lots of cobol batch jobs that were written like 30 or 40 years ago.


How many objects without source did you find there ?

In the 80s (and probably before), every system I worked on had at least one critical COBOL program missing source code.

I am wondering if you noticed the same with this project.


I did some contracting for a retail bank a good while back and the COBOL source for the term deposit interest calculation routine had been lost a few decades earlier but was still in use. I suggested a rewrite but there was no enthusiasm/support for it - especially as my prototype could not exactly reproduce the rounding. Nobody wanted to have to deal with customer communications to explain why one cent of interest was being paid a month earlier or later than before.


> suggested a rewrite but there was no enthusiasm/support for it

Been there :)

Yes, you always get a "no" for doing that.


Ahaha I'm also working with a bank and "customer communications" is a really nice euphemism for "barrage of class action suits". I'm sitting here in the office laughing out loud.


Missing source code? That's terrifying.


Par for the course.


Totally normal, happened all the time :))


Our shop (a bank) still develops in COBOL on a daily basis (IBM mainframe) and has a stock of tenth of millions of code lines.


Converting to Lithp any day now...


I doubt many new projects get started in it, but there's still plenty out there even outside of banking, insurance etc - a guy I used to work with was a PHP developer and at a previous job he'd had to modify COBOL for a company that sold carpet.


Yes?

Any "lets move this COBOL to something else" will, more often than not, flounder on the fact that "rewrite all this functionality" will cost massively more than "just find someone to extend what we already have".



Places that have cobol are the same places that have a bunch or laws being your list of requirement and nobody stopped writing laws in those 20 years. Those are also the places where risk is supposed to be in a different department.


There are probably tens of thousands of active COBOL developers maintaining systems written no later than I guess the nineties.


Absolutely, we have several full time people who specialize in COBOL variants for VMS and OS2200.


[flagged]


I really don't understand comments like this.

Someone spends time and effort building something they find useful and publish it for free. But because they integrated an objectively useful tool that isn't perfect, people feel the need to point that one thing out.

If it's not for you, it's not for you. If you don't like that feature, fork the repo and create your own version without it. Or just ignore it and move on.


It's a short comment so it's easily misunderstood He probably means that due to compliance and security and so on, data leak risks etc, it would never be greenlit to be used because COBOL which is mostly in the financial industry and many other very relied upon infrastructure things, while you could just get advice from chatGPT while coding away for example that wouldn't be allowed because of the proprietary secret nature of the code and the data Even just internal coding customs if understood by the hypothetical enemy through leaks on prompts with important data that was used to train a newer model, raises the risk of being attacked in a more sophisticated manner So while it's awesome that tools are started to be built, and I wished there was more of it while working on the same thing, there is a whole lot of analysis on if a tool is even allowed to be used while still being compliant with all of the various regulations that have to be followed both in working style and security and data protection


Sure, we can all do some interpreting on their comment which fits our narratives, but that's missing the point.

If they made the exact comment you did, it would've been a good addition to a discussion about the usability of the provided tool. They didn't do that. They just wrote a single short sentence dismissing all of the work of the creator of the application for some reason that wasn't actually explained. Which is why I don't understand that type of comment.


s/OpenAI/automated imperfect subject matter expert/g




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: