More

lancekey · 2026-02-05T00:11:52 1770250312

Can you say a bit more about evals and your approach?

alexhans · 2026-02-06T02:38:30 1770345510

High level, the approach is:

- I'm pain point driven:

  - I can't compare what I can't measure. 

  - I can't trust to run this "AI" tool to run on its own

- That's automation, which is about intentionality (can I describe what I want?) and risk profile understanding (What's the blast radius/worst that could happen)

Then I treat it as if it was an Integration Test/Test Driven Development exercise of sorts.

- I don't start designing an entire cloud infrastructure.

- I make sure the "agent" is living in the location where the users actually live so that it can be the equivalent of an extra paid set of hands.

- I ask questions or replicate user stories and use deterministic tests wherever I can. Don't just go for LLMaaJ. What's the simplest thing you can think of?

- The important thing is rapid iteration and control. Just like in a unit testing scenario it's not about just writing a 100 tests but the ones that qualitatively allow you to move as fast as possible.

- At this stage where the space is moving so fast and we're learning so much, don't assume or try to over-optimize places that don't hurt and instead think about minimalism, ease of change, parameterization and ease of comparison with other components that form "the black box" and with itself.

- Once you have the benchmarks that you want, you can decide things like pick the cheapest model/agent configuration that does the job within the acceptable timeframe.

Happy to go deeper on these. I have some practical/runnable samples/text I can share on the topic after the weekend. I'll drop a link here when it's ready

lancekey · 2026-02-10T12:34:19 1770726859

This is really insightful. Thank you.

Your first two points jive with my intuition that an agents primaries should be a code execution sandbox, mounted files and git.

If you have any practical examples to share I’m sure a ton of people would appreciate it.

alexhans · 2026-02-15T19:00:55 1771182055

I just shared this in HN https://news.ycombinator.com/item?id=47026263 to see if it's possible to scale the knowledge sharing and simple and good practices which keep people in control.

It may or may not address the practical examples you need but I'd been to hear your thoughts and maybe it's possible to come up with a more illustrative one.

I didn't go for bubblewrap or similar containers yet because I didn't want to lose a specific type of baseline newcomer yet (Economists who do some coding) but I will be adding to it with whatever most elegant approaches I can find that don't leak too much complexity for things like sandboxing, system testing, integration mocking (reverse proxying), Observing with Openteleletry or otherwise, presenting benchmarks, etc.

lancekey · 2026-01-29T20:41:03 1769719263

Human Emulator offers agent based (near instant) quotes for any computer task.

lancekey · 2026-01-29T14:24:28 1769696668

I'm starting to add inference providers to computeprices.com, but if you even just look at GPU/hr rentals, there are some reasonable options out there.

I personally have been enjoying shadeform to build the GPU setup I like.

lancekey · 2026-01-20T22:31:37 1768948297

Yea looking at 60 day moving average on computeprices.com H100 have actually gone UP in cost recently, at least to rent.

A lot of demand out there for sure.

lancekey · 2026-01-06T12:40:56 1767703256

I’ve been working on computeprices.com as a side project for the last year to do just that.

mentos · 2026-01-07T12:46:38 1767789998

Is there a graph view that charts all GPU prices on one graph?

If not I think the landing page should be just that with checkbox filters for all GPUs on the left that you can easily toggle all on/off to show/hide their line on the graph.

a1371 · 2026-01-06T14:32:15 1767709935

I was not expecting that the prices are going down. Makes sense as the hardware gets older but I always assumed the prices must be inflated given how much competition there is to make new datacenters

lancekey · 2026-01-06T16:45:27 1767717927

Yes i was surprised too. I think it's mostly newer models pushing older ones down. I think there's also a lot of competitive pressure in this market. And the GPU shortage is not really a thing anymore.

claar · 2026-01-06T15:35:34 1767713734

This is cool!

Would it be possible to add "Best Value" / "best average performance per dollar" type thing?

lancekey · 2026-01-06T16:33:02 1767717182

Good idea! I'll noodle on how to define that.

lancekey · 2025-12-18T00:38:12 1766018292

Curious to learn what a “product benchmark” looks like. Is it evals you use to test prompts/models? A third party tool?

Examples from the wild are a great learning tool, anything you’re able to share is appreciated.

thecupisblue · 2025-12-18T15:41:52 1766072512

It's an internal benchmark that I use to test prompts, models and prompt-tunes, nothing but a dashboard calling our internal endpoints and showing the data, basically going through the prod flow.

For my product, I run a video through a multimodal LLM with multiple steps, combine data and spit out the outputs + score for the video.

I have a dataset of videos that I manually marked for my usecase, so when a new model drops, I run it + the last few best benchmarked models through the process, and check multiple things:

- Diff between outputed score and the manual one - Processing time for each step - Input/Output tokens - Request time for each step - Price of request

And the classic stats of average score delta, average time, p50, p90 etc. + One fun thing which is finding the edge cases, since even if the average score delta is low (means its spot-on), there are usually some videos where the abs delta is higher, so these usually indicate niche edge cases the model might have.

Gemini 3 Flash nails it sometimes even better than the Pro version, with nearly the same times as 2.5 Pro does on that usecase. Actually, pushed it to prod yesterday and looking at the data, it seems it's 5 seconds faster than Pro on average, with my cost-per-user going down from 20 cents to 12 cents.

IMO it's pretty rudimentary, so let me know if there's anything else I can explain.

theshrike79 · 2025-12-18T07:59:35 1766044775

Everyone should have their own "pelican riding a bicycle" benchmark they test new models on.

And it shouldn't be shared publicly so that the models won't learn about it accidentally :)

bluecalm · 2025-12-19T09:40:18 1766137218

I am asking the models to generate an image where fictional characters play chess or Texas Holdem. None of them can make a realistic chess position or poker game. Always something is off like too many pawns or too may cards, or some cards being ace-up when they shouldn't be.

ggsp · 2025-12-18T08:13:13 1766045593

Any suggestions for a simple tool to set up your own local evals?

dimava · 2025-12-18T13:52:44 1766065964

Just ask LLM to write one on top of OpenRouter, AI SDK and Bun To take your .md input file and save outputs as md files (or whatever you need) Take https://github.com/T3-Content/auto-draftify as example

theshrike79 · 2025-12-18T11:00:22 1766055622

My "tool" is just prompts saved in a text file that I feed to new models by hand. I haven't built a bespoke framework on top of it.

...yet. Crap, do I need to now? =)

ggsp · 2025-12-18T12:34:57 1766061297

Yeah I’ve wondered about the same myself… My evals are also a pile of text snippets, as are some of my workflows. Thought I’d have a look to see what’s out there and found Promptfoo and Inspect AI. Haven’t tried either but will for my next round of evals

kedihacker · 2025-12-18T14:04:20 1766066660

Well you need to stop them from getting incorporated into its training data

lobsterthief · 2025-12-18T12:12:23 1766059943

_Brain backlog project #77 created_

lancekey · 2025-12-14T19:45:26 1765741526

Working on computeprices.com - a cloud GPU rental price tracker

lancekey · 2025-12-06T16:02:17 1765036937

OP never said he wanted an alternative, just expressed surprise an alternative/competitor didn’t exist.

auggierose · 2025-12-06T16:14:24 1765037664

I know. I just picked on it because on one hand it is great that Wolfram was able to execute on his vision and make it work as a viable product. It needs a lot of resources to make something as great as Mathematica. On the other hand, an essential tool like that, you want to own it for life and grow with it, and I don't think that is quite possible with a "product".

lancekey · 2025-12-05T14:21:55 1764944515

great article and I'm a fan of the chat style about section. kudos!

observer2022 · 2025-12-05T16:01:34 1764950494

Thank you for your kind words! I don’t come from a particularly special background, so I tried to create something interesting and eye-catching to stand out.

lancekey · 2025-10-02T16:52:18 1759423938

SEEKING WORK | NYC / Remote Services: Web Development, Conversion Optimization, Data/Software/Process Consulting, Fractional Tech Leadership

Hi, I’m Gene, a consultant, developer and fractional CTO with 15+ years of experience delivering technology projects that help my customers grow. Main areas of focus: b2b & b2c web apps, ecommerce experiences, and conversion focused marketing sites.

I've helped McDonald's implement a new ERP system, GE develop an investor dashboard and Amazon build custom stores for their biggest sellers. But I especially enjoy helping scrappy startups and small businesses delight their users and convert them into long term clients.

Website: https://lansky.tech LinkedIn: https://www.linkedin.com/in/gkobilansky/

Contact: gene [at] lansky.tech