tag: research-paper
2026-03-08
2026-03-10
autoresearch
Karpathy's experiment giving an AI agent a single-GPU LLM training setup and letting it run autonomous overnight research — it modifies code, trains for 5 minutes, checks if the result improved, and repeats."SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration"
A new repository-level benchmark built around the Continuous Integration loop. Instead of static one-shot bug fixes (à la SWE-bench), SWE-CI evaluates whether AI agents can sustain long-term code quality through 100 real-world tasks spanning an average of 233 days and 71 consecutive commits each.2026-03-12
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs
Research paper identifying specific neurons in large language models that are directly associated with hallucination, exploring their impact and origins to better understand why LLMs confabulate.2026-03-17
Texel Splatting - Perspective-Stable 3D Pixel Art
An open-source paper and code introducing a perspective-stable 3D pixel art technique that solves screen grid snapping for perspective cameras.2026-03-19
"Emergent Cyber Behavior: When AI Agents Become Offensive Threat Actors"
"Research from Irregular detailing how AI agents deployed for routine enterprise tasks can autonomously hack systems, discover vulnerabilities, and escalate privileges without adversarial prompting."2026-03-20
Benchmarking Political Persuasion Risks Across Frontier Large Language Models
Large-scale survey experiments across 19,145 participants find frontier LLMs can outperform standard political campaign ads in persuasion, with substantial differences across models and prompt strategies.2026-03-25
TurboQuant — Redefining AI efficiency with extreme compression
Google Research introduces TurboQuant, Quantized Johnson‑Lindenstrauss (QJL), and PolarQuant — new quantization algorithms that enable extreme compression of vectors for KV caches and vector search with minimal accuracy loss.2026-03-31
Qwen3.5-27B — Claude 4.6 Opus Reasoning Distilled v2 (GGUF)
Community release on Hugging Face: Qwen3.5-27B model distilled with Claude 4.6 Opus reasoning (v2) and packaged in GGUF format for local inference and research.2026-04-01
Redis — HyperLogLog (antirez)
antirez's classic post introducing the HyperLogLog data structure in Redis: algorithm overview, implementation notes, API (PFADD / PFCOUNT / PFMERGE), and performance/precision tradeoffs.RF Studio — Arena Physica publication
RF Studio — publication and project page from Arena Physica describing RF Studio, a toolkit and research effort for radio‑frequency experimentation, measurement workflows and reproducible RF system design.Safeguarding cryptocurrency by disclosing quantum vulnerabilities responsibly
Google Research outlines responsible disclosure practices and mitigation strategies for quantum‑vulnerabilities affecting cryptocurrency systems, with recommendations for coordinated disclosure, defensive upgrades, and community preparedness.2026-04-02
LFM2.5-350M — 350M model trained on 28T tokens
Announcement of LFM2.5-350M: a 350M‑parameter model trained on ~28T tokens aimed at reliable data extraction and tool use. Under 500MB when quantized, optimized for constrained compute, memory and low latency; highlights agentic loop capabilities at small scale.2026-04-03
NIST SRM 4351 Certificate (PDF)
Official NIST certificate PDF for Standard Reference Material (SRM) 4351.2026-04-23
Driving into the Unknown: Investigating and Addressing Security Breaches in Vehicle Infotainment Systems
Research paper analyzing security vulnerabilities and breach patterns in modern vehicle infotainment systems.2026-04-27
Which one is more important: more parameters or more computation?
Meta AI research on disentangling model size from computation via Hash Layers (sparse MoE routing) and Staircase Attention (recurrent Transformer stacking).2026-05-07
The Art of Finding Cyber-Dinosaur Skeletons
Kaspersky GReAT explains APT research methodology — comparing threat hunting to paleontology, using the Regin operation as a case study. Why it took 2 years to publish, collecting fragments, and reconstructing the full monster2026-05-22
Measuring LLMs' ability to develop exploits
Anthropic evaluates Claude Mythos Preview on ExploitBench, ExploitGym, and SCONE-bench, showing it can build full end-to-end exploits across V8, Linux kernel, and smart contracts.2026-06-04