Show HN: Pytest-evals – Simple LLM apps evaluation using pytest

almogbaku 5 months ago

Hey HN! Creator here. I recently found myself writing evaluations for actual production LLM projects, and kept facing the same dilemma: either reinvent the wheel or use a heavyweight commercial system with tons of features I don't need right now.

Then it hit me - evaluations are just (kind of) tests, so why not write them as such using pytest?

That's why I created pytest-evals - a lightweight pytest plugin for building evaluations. It's intentionally not a sophisticated system with dashboards (and not suitable as a "robust" solution). It's minimalistic, focused, and definitely not trying to be a startup

  # Predict the LLM performance for each case
  @pytest.mark.eval(name="my_classifier")
  @pytest.mark.parametrize("case", TEST_DATA)
  def test_classifier(case: dict, eval_bag, classifier):
      # Run predictions and store results
      eval_bag.prediction = classifier(case["Input Text"])
      eval_bag.expected = case["Expected Classification"]
      eval_bag.accuracy = eval_bag.prediction == eval_bag.expected

  # Now let's see how our app performing across all cases...
  @pytest.mark.eval_analysis(name="my_classifier")
  def test_analysis(eval_results):
      accuracy = sum([result.accuracy for result in eval_results]) / len(eval_results)
      print(f"Accuracy: {accuracy:.2%}")
      assert accuracy >= 0.7  # Ensure our performance is not degrading

Would love to hear your thoughts and if you find this useful, a GitHub star would be appreciated

westurner 5 months ago

The pytest-evals README mentions that it's built on pytest-harvest, which works with pytest-xdist and pytest-asyncio.

pytest-harvest: https://smarie.github.io/python-pytest-harvest/ :

> Store data created during your pytest tests execution, and retrieve it at the end of the session, e.g. for applicative benchmarking purposes

almogbaku 5 months ago

Yeah, pytest-harvest is a pretty cool plugin.
Originally I had a (very large and unfriendly) conftest file, but it was quite challenging to collaborate with other team members and was quite repetitive. So I wrapped it as a plugin, added some more functionalities and thats it.
This plugin wraps some boilerplate code in a way that is easy to use specially for the eval use-case. It’s minimalistic by design. Nothing big or fancy