I didn't use my skills really to guide it, beyond knowing what test suite exists, and knowing it could reliably port a good HTML/CSS parser (to golang).
I burned through a week of codex max subscription quota with it making slow progress. Slower than I wanted, but consistent slow progress. Not long after though it hit a brick wall and the second week+ of quota did not really make much progress. It can churn for hours and get one test working, or churn for hours and fail.
I might give it a go with more guidance, because I actually really would like an html/css renderer with some specific properties about how I interact with it from code.
What was the results? As far as I can tell, less agents work better, paired together with someone who knows solid engineering.