[{"_id":"6a2cd0828137fb18cecbcc06","id":"Glint-Research/Fable-5-traces","author":"Glint-Research","disabled":false,"gated":false,"lastModified":"2026-06-29T15:10:20.000Z","likes":475,"trendingScore":102,"private":false,"sha":"e05c417852fc59fd8da758e68b352732423ca0cb","description":"\n  \n\n\n\n  \n    \n      \n        Glint Research Dataset Card\n        Fable 5 Pi Agent Traces\n        A compact, high-signal corpus of Fable 5 coding-agent traces converted into Hugging Face Agent Traces / Pi-compatible sessions for Data Studio inspection, tool-use policy learning, and reasoning/action distillation.\n      \n      \n        Primary Config\n        pi_agent/train\n        Agent Trace preview enabled\n      \n    \n    \n      4,665 Pi trace sessions\n      60 source sessions\n      3,799 tool… See the full description on the dataset page: https://huggingface.co/datasets/Glint-Research/Fable-5-traces.","downloads":38161,"tags":["task_categories:text-generation","annotations_creators:machine-generated","language:en","license:agpl-3.0","size_categories:1K<n<10K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","agent-traces","pi-agent","claude-code","fable-5","chain-of-thought","tool-use","coding-agents","synthetic-data","distillation","cot"],"createdAt":"2026-06-13T03:37:38.000Z","key":""},{"_id":"6a2a47c4f5ff6c6dee016974","id":"armand0e/claude-fable-5-claude-code","author":"armand0e","disabled":false,"gated":false,"lastModified":"2026-06-19T16:23:10.000Z","likes":246,"trendingScore":52,"private":false,"sha":"c19fb6831700da833b22d1c9cdac47fe8603685c","description":"\n\t\n\t\t\n\t\n\t\n\t\tclaude-fable-5 Agent Traces\n\t\n\nIt's worth noting that our team was working with Glint-Research to collect as much fable data as possible.\nThese are just the anonymized raw traces of both of our teams combined. This means that Glint-Research/Fable-5-traces was created from formatting and splitting up this same dataset. If you use one for your tune, don't use the other (it's the same exact data).\n\nFor training on this dataset I recommend using the teich package to convert to openai… See the full description on the dataset page: https://huggingface.co/datasets/armand0e/claude-fable-5-claude-code.","downloads":12128,"tags":["task_categories:text-generation","size_categories:n<1K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","agent-traces","format:agent-traces","claude","distillation","claude-fable-5","teich"],"createdAt":"2026-06-11T05:29:40.000Z","key":""},{"_id":"6a394ba974e3ccb07645f8a7","id":"Qwen/AgentWorldBench","author":"Qwen","disabled":false,"gated":false,"lastModified":"2026-06-24T02:05:57.000Z","likes":55,"trendingScore":50,"private":false,"sha":"db74b0cca6f7dd41b9684b18a3633d13f2bbf783","description":"\n\t\n\t\t\n\t\n\t\n\t\tAgentWorldBench\n\t\n\nAgentWorldBench is a comprehensive evaluation benchmark for language world models, constructed from real-world observations of frontier model trajectories on established benchmarks such as Tool Decathlon, Terminal-Bench 1.0 & 2.0, and OSWorld-Verified. Every evaluation sample is paired with a ground-truth observation obtained from real environment execution, enabling reference-grounded scoring.\nAgentWorldBench evaluates world modeling quality by scoring each… See the full description on the dataset page: https://huggingface.co/datasets/Qwen/AgentWorldBench.","downloads":1169,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2606.24597","region:us","world-model","agent","benchmark","evaluation","environment-simulation","qwen"],"createdAt":"2026-06-22T14:50:17.000Z","key":""},{"_id":"670befa7623c91990f914eb6","id":"mlabonne/open-perfectblend","author":"mlabonne","disabled":false,"gated":false,"lastModified":"2025-01-15T20:01:32.000Z","likes":114,"trendingScore":41,"private":false,"sha":"af60f3c18201652a83a93f46fcfee1b646ba3df7","description":"\n\n\t\n\t\t\n\t\t🎨 Open-PerfectBlend\n\t\n\nOpen-PerfectBlend is an open-source reproduction of the instruction dataset introduced in the paper \"The Perfect Blend: Redefining RLHF with Mixture of Judges\".\nIt's a solid general-purpose instruction dataset with chat, math, code, and instruction-following data.\n\n\t\n\t\t\n\t\tData source\n\t\n\n\nHere is the list of the datasets used in this mix:\n\n\t\n\t\t\nDataset\n# Samples\n\n\n\t\t\nmeta-math/MetaMathQA\n395,000\n\n\nopenbmb/UltraInteract_sft\n288,579\n\n\nHuggingFaceH4/ultrachat_200k… See the full description on the dataset page: https://huggingface.co/datasets/mlabonne/open-perfectblend.","downloads":1769,"tags":["license:apache-2.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2409.20370","region:us"],"createdAt":"2024-10-13T16:04:55.000Z","key":""},{"_id":"69fa9e0468659d62c5c9df7b","id":"LocalLaws/LOCUS-v1","author":"LocalLaws","disabled":false,"gated":false,"lastModified":"2026-06-20T03:06:10.000Z","likes":77,"trendingScore":33,"private":false,"sha":"4cee954ca8ad8e31cb0502dff6682c87b74b4302","description":"\n\t\n\t\t\n\t\n\t\n\t\tLOCUS v1.0\n\t\n\nThis repository contains the dataset presented in the paper Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nLOCUS v1.0 is a chunk-level dataset of U.S. municipal and county law text labeled by legal function. Each eligible chunk is assigned a function, a binary is_substantive label, and all substantive provisions are assigned a topic.\nThe dataset is intended for legal text research, local-law structure… See the full description on the dataset page: https://huggingface.co/datasets/LocalLaws/LOCUS-v1.","downloads":1665,"tags":["task_categories:text-classification","language:en","license:cc-by-nc-4.0","size_categories:1M<n<10M","format:parquet","format:optimized-parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2606.19334","region:us","law","legal-nlp","local-government","municipal-law","ordinances","classification"],"createdAt":"2026-05-06T01:48:52.000Z","key":""},{"_id":"6a3404497e03daf35bd3202e","id":"scholarweave/arxiv-latex","author":"scholarweave","disabled":false,"gated":false,"lastModified":"2026-06-30T07:57:39.000Z","likes":32,"trendingScore":31,"private":false,"sha":"b36d5e666af80149b4d4a49c7141e0f9df6e580e","description":"\n\t\n\t\t\n\t\n\t\n\t\tarXiv LaTeX Source Dataset\n\t\n\nThis dataset provides the entire corpus of arXiv's LaTeX source files, pre-parsed, formatted, and aligned with official metadata in ready-to-query Parquet files.\n\n\n\t\n\t\t\n\t\n\t\n\t\tWhy I Built This\n\t\n\nIf you have ever tried to work with the complete history of arXiv papers at scale, you have likely run into two massive hurdles:\n\nNetwork Egress Costs: While arXiv does offer public bulk access to its source files via S3 (s3://arxiv), the bucket is configured… See the full description on the dataset page: https://huggingface.co/datasets/scholarweave/arxiv-latex.","downloads":7967,"tags":["task_categories:text-generation","task_categories:feature-extraction","language:en","license:other","size_categories:1M<n<10M","modality:text","region:us","science","arxiv","latex","academic"],"createdAt":"2026-06-18T14:44:25.000Z","key":""},{"_id":"6a27d0ad7a6ecef661d66995","id":"BitRobot/HIW-500","author":"BitRobot","disabled":false,"gated":false,"lastModified":"2026-06-29T13:47:34.000Z","likes":33,"trendingScore":30,"private":false,"sha":"2ca7ffcd85ec5212f81ae08491a4076bf48ea841","description":"\n\t\n\t\t\n\t\n\t\n\t\tHIW-500: Humanoids In-the-Wild Dataset\n\t\n\nhttps://bitrobot-foundation.github.io/humanoids-in-the-wild-500-hours/\nHIW-500: Humanoids In-the-Wild Dataset is a large-scale dataset for whole-body humanoid robot learning in natural home environments. It captures human teleoperation demonstrations on Unitree G1 across real homes in Southeast Asia, where layouts, object states, lighting, clutter, and operator styles vary from episode to episode.\nThe dataset is designed for research on… See the full description on the dataset page: https://huggingface.co/datasets/BitRobot/HIW-500.","downloads":51732,"tags":["language:en","license:cc-by-4.0","region:us","robotics","humanoid"],"createdAt":"2026-06-09T08:37:01.000Z","key":""},{"_id":"6a34e9d01b6b6e116d313e13","id":"Crownelius/Complete-FABLE.5-traces-2M","author":"Crownelius","disabled":false,"gated":false,"lastModified":"2026-06-21T12:26:51.000Z","likes":39,"trendingScore":28,"private":false,"sha":"19a5b7863e10eec6838cf531bd20d24d2ec1106e","description":"\n  \n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tComplete FABLE.5 Traces 2M\n\t\n\nFull FABLE.5 / Mythos corpus restored, with session-limit answer rows removed.\nDataset Viewer | Parquet | Raw JSONL.gz\n\n\nThis dataset is a post-closure compilation of all available FABLE.5 / Mythos trace datasets found on Hugging Face during the curation pass after the closure of Fable and Mythos. It is deduplicated at the normalized-row level and keeps row-level provenance through first_source_dataset, first_source_config, first_source_split… See the full description on the dataset page: https://huggingface.co/datasets/Crownelius/Complete-FABLE.5-traces-2M.","downloads":2471,"tags":["task_categories:text-generation","task_ids:language-modeling","annotations_creators:machine-generated","language_creators:found","language_creators:machine-generated","multilinguality:monolingual","language:en","license:mit","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agent-traces","traces","claude-code","fable-5","mythos","chain-of-thought","tool-use","coding-agents","synthetic-data","deduplicated","llm-traces","data-curation","parquet","json"],"createdAt":"2026-06-19T07:03:44.000Z","key":""},{"_id":"6a06cc4dea20d325f8fc6213","id":"ArtificialAnalysis/ITBench-AA","author":"ArtificialAnalysis","disabled":false,"gated":false,"lastModified":"2026-05-27T01:28:25.000Z","likes":44,"trendingScore":25,"private":false,"sha":"76df38a82288f75ba9e41dc8c515033332497473","description":"\n\t\n\t\t\n\t\n\t\n\t\tITBench-AA\n\t\n\nArtificial Analysis' release of the public scenarios from\nIBM's ITBench benchmark, used for\nthe ITBench-AA leaderboard.\nThis repo currently contains the SRE subset (sre config). Each row is a\nKubernetes incident scenario with its expected contributing-factor entities. An\nagent under evaluation is given access to an offline snapshot of the affected\ncluster (alerts, events, traces, topology) and must identify the entity\n(Deployment, Pod, ConfigMap, etc.) responsible for… See the full description on the dataset page: https://huggingface.co/datasets/ArtificialAnalysis/ITBench-AA.","downloads":41009,"tags":["task_categories:question-answering","language:en","license:cc-by-4.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","sre","kubernetes","root-cause-analysis","agents","it-operations"],"createdAt":"2026-05-15T07:33:33.000Z","key":""},{"_id":"6a3543c41278d868e0c1bf12","id":"bcbl190626/SpanishBCBL","author":"bcbl190626","disabled":false,"gated":false,"lastModified":"2026-06-29T11:56:46.000Z","likes":25,"trendingScore":25,"private":false,"sha":"88f9096c6ce3a3fb17cc7b8e3131ff7f96da5684","description":"\n\t\n\t\t\n\t\n\t\n\t\tDECOMEG — Brain Activity During Typing (MEG & EEG)\n\t\n\nNon-invasive brain recordings (magnetoencephalography, MEG; and electroencephalography, EEG)\nof healthy adults typing briefly-memorized sentences on a QWERTY keyboard. This is the dataset\nunderlying Brain2Qwerty (Lévy et al., 2025) and its companion neuroscience study\n(Zhang et al., 2025).\n\n\t\n\t\t\n\t\n\t\n\t\tSummary\n\t\n\n\nParticipants: 35 healthy adult volunteers recruited at the Basque Center on Cognition,\nBrain and Language (BCBL), San… See the full description on the dataset page: https://huggingface.co/datasets/bcbl190626/SpanishBCBL.","downloads":1349,"tags":["task_categories:other","language:es","license:cc-by-nc-4.0","arxiv:2502.07429","region:us","neuroscience","meg","eeg","brain-computer-interface","bci","brain-to-text","typing","motor","electrophysiology"],"createdAt":"2026-06-19T13:27:32.000Z","key":""},{"_id":"6a05fb804b04c5157df46866","id":"WithinUsAI/claude_mythos_distilled_25k","author":"WithinUsAI","disabled":false,"gated":false,"lastModified":"2026-05-18T00:45:03.000Z","likes":132,"trendingScore":20,"private":false,"sha":"2c5e638c51a22b8b883def51bab685ae7e282c72","description":"\n\t\n\t\t\n\t\n\t\n\t\tClaude Mythos Distilled 25K\n\t\n\nA high-quality synthetic supervised fine-tuning (SFT) dataset designed to train and fine-tune any LLM to mirror the capabilities, reasoning style, agentic behavior, and technical depth of Anthropic's Claude Mythos (distilled frontier model).\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\n\nSize: 25,000 high-quality examples\nFormat: JSONL with chat messages (user/assistant pairs) + rich metadata\nCategories (balanced for general + specialized capability):\nCybersecurity… See the full description on the dataset page: https://huggingface.co/datasets/WithinUsAI/claude_mythos_distilled_25k.","downloads":3475,"tags":["language:en","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","synthetic","claude","mythos","distillation","cybersecurity","coding","reasoning","agentic","frontier-model-mirror","sft","instruction-tuning"],"createdAt":"2026-05-14T16:42:40.000Z","key":""},{"_id":"6a307dae8e258cbed418ec58","id":"XDOF/ABC-130k","author":"XDOF","disabled":false,"gated":"auto","lastModified":"2026-06-26T22:13:45.000Z","likes":66,"trendingScore":20,"private":false,"sha":"fad18a5f891a47e665756d4cab2a67a7a080d8bb","description":"\n\t\n\t\t\n\t\n\t\n\t\tABC-130k\n\t\n\nABC-130k is the largest open-source robot teleoperation dataset. It contains\nbimanual manipulation trajectories collected on two-arm YAM stations. Episodes\nare distributed as MCAP files, with subtask annotations kept as separate\nartifacts so they can be revised or extended independently of the underlying\nepisode data. For details on the accompanying paper, see abc.bot.\nPlease see the GitHub repo here for code to\ntrain and deploy with this dataset.\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset… See the full description on the dataset page: https://huggingface.co/datasets/XDOF/ABC-130k.","downloads":412268,"tags":["task_categories:robotics","language:en","license:apache-2.0","size_categories:n>1T","region:us","robotics","manipulation","imitation-learning","bimanual","teleoperation","mcap"],"createdAt":"2026-06-15T22:33:18.000Z","key":""},{"_id":"6a3a711102e6e7f9c77f3a2a","id":"Rapidata/svg-benchmark","author":"Rapidata","disabled":false,"gated":false,"lastModified":"2026-06-29T11:22:36.000Z","likes":20,"trendingScore":20,"private":false,"sha":"099319e447586f3e49ed3140c1d2ffe5cdf4647e","description":"\n\t\n\t\t\n\t\n\t\n\t\tRapidata Static SVG Generation Benchmark\n\t\n\nBuilt by Rapidata.\nThis dataset contains 1,355,161 human responses, collected with the\nRapidata Python SDK, comparing how well 30 frontier LLMs generate\nstatic SVGs from text prompts. Each row is a head-to-head comparison between two models' renders of\nthe same prompt, scored by human annotators on one of three questions (Preference, Coherence, Alignment).\nThe SVGs are produced as raw <svg> markup by the models, rasterized to 768×768 PNGs… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/svg-benchmark.","downloads":513,"tags":["task_categories:text-to-image","task_categories:image-classification","task_categories:reinforcement-learning","language:en","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","format:optimized-parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","SVG","vector-graphics","code-generation","text-to-svg","Human","Preference","Alignment","Coherence","rapidata","benchmark","llm-evaluation"],"createdAt":"2026-06-23T11:42:09.000Z","key":""},{"_id":"6a2c5668f7f66fcaa0d54e17","id":"CodeDevX/Vibe-Coding-Instruct","author":"CodeDevX","disabled":false,"gated":false,"lastModified":"2026-06-18T13:52:24.000Z","likes":173,"trendingScore":19,"private":false,"sha":"7ad49b3cbf0b73934b1d567d2b5c4768bce7989e","downloads":2299,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","custom","vibecodinginstruct"],"createdAt":"2026-06-12T18:56:40.000Z","key":""},{"_id":"67d45c3d35fc7f6d2ab224c8","id":"allenai/olmOCR-bench","author":"allenai","disabled":false,"gated":false,"lastModified":"2026-02-19T17:28:38.000Z","likes":251,"trendingScore":18,"private":false,"sha":"54a96a6fb6a2bd3b297e59869491db4d3625b711","description":"\n\t\n\t\t\n\t\n\t\n\t\tolmOCR-bench\n\t\n\nolmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. \nThis benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information.\nQuick links:\n\n📃 Paper\n🛠️ Code\n🎮 Demo\n\n\n\t\n\t\t\n\t\n\t\n\t\tTable 1. Distribution of Test Classes by Document Source\n\t\n\n\n\t\n\t\t\nDocument Source\nText Present\nText… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.","downloads":7476,"tags":["benchmark:official","benchmark:eval-yaml","language:en","license:odc-by","size_categories:1K<n<10K","modality:document","modality:text","arxiv:2502.18443","region:us","text"],"createdAt":"2025-03-14T16:41:33.000Z","key":""},{"_id":"6a2d8bf9763f90e1368360cb","id":"lordx64/agentic-distill-fable-5-sft","author":"lordx64","disabled":false,"gated":false,"lastModified":"2026-06-15T14:15:12.000Z","likes":49,"trendingScore":17,"private":false,"sha":"9df06dd13b692dd482bd6ef0e547f577a5f94942","description":"\n\t\n\t\t\n\t\n\t\n\t\tFable-5 SFT — prepared for Qwable fine-tuning\n\t\n\n4,659 single-turn pairs from Claude Fable-5 (Anthropic preview model, suspended globally 2026-06-22 under U.S. export-control directives), reformatted into a single-text-column parquet ready for SFTTrainer(dataset_text_field=\"text\") + train_on_responses_only.\nComposition:\n\n3,793 rows (81%) end in a <tool_use> block — agentic tool-call patterns\n866 rows (19%) end in a pure text response\n\nThis is agentic data, not pure reasoning data.… See the full description on the dataset page: https://huggingface.co/datasets/lordx64/agentic-distill-fable-5-sft.","downloads":1254,"tags":["task_categories:text-generation","language:en","license:agpl-3.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agentic","chain-of-thought","distillation","claude","claude-fable-5","agent-traces","sft","qwen-chat-template","qwable"],"createdAt":"2026-06-13T16:57:29.000Z","key":""},{"_id":"69f434edee1d16ec78d229ce","id":"angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k","author":"angrygiraffe","disabled":false,"gated":false,"lastModified":"2026-05-01T17:11:41.000Z","likes":424,"trendingScore":16,"private":false,"sha":"f0330e0ca46469b3928adef18c2b55f9476d6bd3","description":"\n\t\n\t\t\n\t\n\t\n\t\tBackground\n\t\n\nEnded up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The development evolved as it went along. The dataset has not been manually reviewed. It's entirely Claude developed.\n\n\t\n\t\t\n\t\n\t\n\t\tClarification on Reasoning\n\t\n\nThe reasoning is not Claude's actual chain-of-thought (cot) and is not summarized cot. It's a fully synthetic cot created as part of the Assistant response to mimic the type of \"thinking\"… See the full description on the dataset page: https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k.","downloads":8975,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","region:us","sft","chain-of-thought","coding","math","roleplay","science","humanities","art","multi-turn","text","json"],"createdAt":"2026-05-01T05:06:53.000Z","key":""},{"_id":"6a27d2d419ba88e1bcf065fb","id":"BitRobot/HIW-500-LeRobot","author":"BitRobot","disabled":false,"gated":false,"lastModified":"2026-06-29T13:48:50.000Z","likes":16,"trendingScore":16,"private":false,"sha":"d935d493368f9cb29addf6917b2a877e586efefd","description":"This dataset was created using LeRobot.\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\nmeta/info.json:\n{\n    \"codebase_version\": \"v3.0\",\n    \"fps\": 30,\n    \"features\": {\n        \"observation.images.head\": {\n            \"dtype\": \"video\",\n            \"shape\": [\n                480,\n                1280,\n                3\n            ],\n            \"names\": [\n                \"height\",\n                \"width\",\n                \"channels\"\n            ],\n            \"info\": {\n                \"video.height\":… See the full description on the dataset page: https://huggingface.co/datasets/BitRobot/HIW-500-LeRobot.","downloads":17764,"tags":["task_categories:robotics","language:en","license:cc-by-4.0","size_categories:10M<n<100M","format:parquet","modality:tabular","modality:text","modality:timeseries","modality:video","library:datasets","library:dask","library:polars","library:mlcroissant","library:lerobot","region:us","LeRobot","robotics","humanoid","unitree-g1","manipulation","mobile-manipulation","bimanual","imitation-learning","teleoperation","in-the-wild"],"createdAt":"2026-06-09T08:46:12.000Z","key":""},{"_id":"6a3246e7ae94378f6d10aff0","id":"PawanKrd/claude-fable-5-code","author":"PawanKrd","disabled":false,"gated":false,"lastModified":"2026-06-17T07:37:06.000Z","likes":25,"trendingScore":14,"private":false,"sha":"4bf63f6009a984b50f5a7e07368e3fe24fa849aa","description":"\n\t\n\t\t\n\t\n\t\n\t\tClaude Fable 5 Coding and Math Dataset (Non-Thinking)\n\t\n\nThis repository contains a dataset of 603 coding and math-related prompts and responses from Claude Fable 5.\nThe generation of this dataset cost approximately $75.\nPlease note that this dataset is non-thinking. Fable 5 only supported adaptive thinking, and it decided not to think for these prompts, meaning there is no chain-of-thought/reasoning content in this dataset.\n\n\t\n\t\t\n\t\n\t\n\t\tOrigin of Prompts\n\t\n\nThe prompts in this… See the full description on the dataset page: https://huggingface.co/datasets/PawanKrd/claude-fable-5-code.","downloads":647,"tags":["task_categories:text-generation","language:en","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","code","claude","fable-5"],"createdAt":"2026-06-17T07:04:07.000Z","key":""},{"_id":"69af1e7e484ef491320be72e","id":"Aignostics/OpenTME","author":"Aignostics","disabled":false,"gated":"manual","lastModified":"2026-06-19T12:03:22.000Z","likes":30,"trendingScore":13,"private":false,"sha":"b2e4a22823f8d14d1097d568461afeeb1e5bc67e","description":"\n\t\n\t\t\n\t\n\t\n\t\tOpenTME: Open-Access Tumor Microenvironment Profiles from TCGA\n\t\n\nOpenTME is an open-access project by Aignostics for academic researchers. It provides comprehensive spatial outputs for whole slide images (WSIs) of formalin-fixed, paraffin-embedded slides from The Cancer Genome Atlas (TCGA). OpenTME is powered by Atlas H&E-TME – a computational pathology application developed by Aignostics.\n\n\t\n\t\t\n\t\n\t\n\t\tAtlas H&E-TME\n\t\n\nAtlas H&E-TME is a foundation model-based application for… See the full description on the dataset page: https://huggingface.co/datasets/Aignostics/OpenTME.","downloads":15595,"tags":["task_categories:image-classification","task_categories:image-segmentation","task_categories:image-feature-extraction","task_categories:object-detection","license:other","size_categories:10K<n<100K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","arxiv:2604.12075","region:us","biomarker-discovery","biology","bladder-cancer","breast-cancer","cancer","cell-segmentation","colorectal-cancer","computational-pathology","digital-pathology","H&E","liver-cancer","lung-cancer","oncology","pancreatic-cancer","pathology","prostate-cancer","spatial-biology","stomach-cancer","TCGA","tissue-segmentation","tumor-microenvironment","whole-slide-imaging"],"createdAt":"2026-03-09T19:24:46.000Z","key":""},{"_id":"69e15643062441e6b7109caa","id":"nvidia/Open-SWE-Traces","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-25T16:48:33.000Z","likes":31,"trendingScore":12,"private":false,"sha":"f6689f56f1af2e2082861738071d4c4278b1922a","description":"\n\t\n\t\t\n\t\n\t\n\t\tOpen-SWE-Traces: Advancing Distillation for Software Engineering Agents\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tData Overview\n\t\n\nOpen-SWE-Traces is an agentic instruction tuning dataset designed to advance the capabilities of LLMs in software engineering. This dataset comprises 200k+ agent \ntrajectories collected using the SWE-agent and OpenHands framework. The trajectories \nwere synthesized using Minimax-M2.5 and Qwen3.5-122B-A10B and \nspecifically curated for supervised fine-tuning (SFT), aiming to… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Open-SWE-Traces.","downloads":2647,"tags":["license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2606.16038","region:us","code","synthetic","tools","agents","software"],"createdAt":"2026-04-16T21:36:03.000Z","key":""},{"_id":"6a35a1b97d1c93c320e0c0d1","id":"AletheiaResearch/GLM-5.2-Agent","author":"AletheiaResearch","disabled":false,"gated":false,"lastModified":"2026-06-23T18:46:49.000Z","likes":21,"trendingScore":12,"private":false,"sha":"10f3e37942a1f5abb6b3f04c71886f9a36248fa3","description":"This dataset was generated using teich by TeichAI \n\n\t\n\t\t\n\t\n\t\n\t\tGLM-5.2 Agent traces\n\t\n\nThis directory contains raw agent trace files generated by teich.\nJSONL files: 284\nModel metadata: z-ai/glm-5.2\n\n\t\n\t\t\n\t\n\t\n\t\tTraining-ready tools\n\t\n\nGenerated agent traces carry configured or recovered tool schemas so tools remain available for training even when a session did not call them.\nNative Claude Code imports recover schemas for Claude Code and Claude Desktop built-ins, plus conservative name-derived… See the full description on the dataset page: https://huggingface.co/datasets/AletheiaResearch/GLM-5.2-Agent.","downloads":1029,"tags":["task_categories:text-generation","size_categories:n<1K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:eu","agent-traces","format:agent-traces","pi","distillation","z-ai/glm-5.2","teich"],"createdAt":"2026-06-19T20:08:25.000Z","key":""},{"_id":"66212f29fb07c3e05ad0432e","id":"HuggingFaceFW/fineweb","author":"HuggingFaceFW","disabled":false,"gated":false,"lastModified":"2025-07-11T20:16:53.000Z","likes":2908,"trendingScore":11,"private":false,"sha":"9bb295ddab0e05d785b879661af7260fed5140fc","description":"\n\t\n\t\t\n\t\n\t\n\t\t🍷 FineWeb\n\t\n\n\n    \n\n\n\n15 trillion tokens of the finest data the 🌐 web has to offer\n\n\n\t\n\t\t\n\t\n\t\n\t\tWhat is it?\n\t\n\nThe 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. \n🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.","downloads":253313,"tags":["task_categories:text-generation","language:en","license:odc-by","size_categories:10B<n<100B","modality:tabular","modality:text","arxiv:2306.01116","arxiv:2109.07445","arxiv:2406.17557","doi:10.57967/hf/2493","region:us"],"createdAt":"2024-04-18T14:33:13.000Z","key":""},{"_id":"69e9263752b11322a41bff09","id":"Meddies/meddies-persona-vie","author":"Meddies","disabled":false,"gated":false,"lastModified":"2026-06-27T21:40:41.000Z","likes":14,"trendingScore":11,"private":false,"sha":"318ad9e530bfc054b0adf1bb439a8badfe91feb3","description":"\n\t\n\t\t\n\t\n\t\n\t\tMeddies Persona VIE\n\t\n\n\n  \n  \n  \n  \n\n\nVietnamese synthetic patient personas for teams that need better patient context, not just better-looking generated text.\n\n[!IMPORTANT]\nThis is a research and simulation artifact for healthcare AI teams building Vietnamese synthetic data pipelines. It is not a record of real patients, not a prevalence reference, and not a tool for clinical decision-making.\nIf you want to use this dataset in commercial work, please contact us at… See the full description on the dataset page: https://huggingface.co/datasets/Meddies/meddies-persona-vie.","downloads":941,"tags":["task_categories:other","annotations_creators:machine-generated","language_creators:machine-generated","multilinguality:monolingual","source_datasets:HoangHa/meddies-persona","language:vi","license:cc-by-nc-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","medical","healthcare","synthetic","persona","vietnamese","patient-simulation","clinical-simulation","text-generation"],"createdAt":"2026-04-22T19:49:11.000Z","key":""},{"_id":"6a261e159c9c7a503bfdea7c","id":"open-thoughts/OpenThoughts-Agent-SFT-100K","author":"open-thoughts","disabled":false,"gated":false,"lastModified":"2026-06-08T07:01:02.000Z","likes":12,"trendingScore":11,"private":false,"sha":"45fb28fcc38d352133cb28a1c8a43a2f14fea97b","description":"\n    \n\n\n\nProject |\nCode |\nCollection\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOpenThoughts-Agent-SFT-100K\n\t\n\nOpenThoughts-Agent is an open-source effort to curate the best datasets for training agents. Our release includes datasets, models and our research codebase.\nOpenThoughts-Agent-SFT-100K is the 100,000-example point of the OpenThoughts-Agent SFT scaling ladder (sizes 316 / 1K / 3.16K / 10K / 31.6K / 100K). It contains (task, agent-trajectory) pairs used to fine-tune OpenThinkerAgent-8B-SFT-100K and… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-Agent-SFT-100K.","downloads":440,"tags":["language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","agents","terminal","code","software-engineering","sft"],"createdAt":"2026-06-08T01:42:45.000Z","key":""},{"_id":"67d05b35637847cd702f212e","id":"futo-org/swipe.futo.org","author":"futo-org","disabled":false,"gated":false,"lastModified":"2026-06-25T20:28:18.000Z","likes":26,"trendingScore":10,"private":false,"sha":"d71bf5fd7f45b3e7c2ed2d76a21b0dbd3b4ba566","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for swipe.futo.org\n\t\n\nThis dataset is presented in the paper FUTO Swipe: Layout-Agnostic Neural Swipe Decoding.\nIt contains multiple collection runs from the swipe.futo.org website. The QWERTY layout definition is provided here\n\n\t\n\t\t\n\t\n\t\n\t\tCollection process\n\t\n\nUsers were able to volunteer to contribute to our dataset. After visiting the site on a mobile device, they were given words to swipe as part of a pre-defined sentence set.\nUsers were allowed to go back to retry… See the full description on the dataset page: https://huggingface.co/datasets/futo-org/swipe.futo.org.","downloads":545,"tags":["task_categories:other","language:en","license:mit","size_categories:1M<n<10M","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.25247","region:us"],"createdAt":"2025-03-11T15:48:05.000Z","key":""},{"_id":"6a421e32b1d98b18fb3edd74","id":"ginigen-ai/Metacognition-Bench","author":"ginigen-ai","disabled":false,"gated":false,"lastModified":"2026-06-30T00:02:39.000Z","likes":16,"trendingScore":10,"private":false,"sha":"68b3b71150e2e0e0dd658962fcf6e0b460107a37","description":"\n\t\n\t\t\n\t\n\t\n\t\tMetacognition-Bench\n\t\n\n\n\n\n\"Not whether a model knows the answer — but whether it knows when it might be wrong, and can correct itself.\"\n\nMetacognition-Bench is a curated benchmark of 300 metacognitive-trap problems that measure functional metacognition in Large Language Models: the ability to detect and recover from one's own reasoning errors, rather than final-answer accuracy alone.\nEvery problem embeds a hidden_trap — a seductive but wrong reasoning path that makes even capable… See the full description on the dataset page: https://huggingface.co/datasets/ginigen-ai/Metacognition-Bench.","downloads":0,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:apache-2.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","metacognition","self-correction","hallucination-detection","reasoning","benchmark","trap-escape","error-recovery","metacognition-adapter","aether"],"createdAt":"2026-06-29T07:26:42.000Z","key":""},{"_id":"6a2c0ff5f05071e5d8d863dd","id":"makora-ai/triton-gpu-latency","author":"makora-ai","disabled":false,"gated":false,"lastModified":"2026-06-12T15:14:03.000Z","likes":13,"trendingScore":9,"private":false,"sha":"3b30911350fa433e1cb97ec2a41bd8908dd35d5d","description":"\n\t\n\t\t\n\t\n\t\n\t\tTriton GPU Latency Dataset\n\t\n\nA large dataset of PyTorch problems (mostly from KernelBench) paired with candidate Triton-kernel implementations and their measured GPU runtimes generated by MakoraGenerate. Each row is a self-contained Python program that defines (1) a reference Model written with plain PyTorch ops and (2) a ModelNew that re-implements the same forward pass with a hand-written or generated Triton kernel. The label is the runtime of executing ModelNew.\nBuilt for… See the full description on the dataset page: https://huggingface.co/datasets/makora-ai/triton-gpu-latency.","downloads":161,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","gpu","triton","cuda","kernel-generation","code","pytorch","latency","benchmark"],"createdAt":"2026-06-12T13:56:05.000Z","key":""},{"_id":"69d3b00b2d56eb23d8824420","id":"badlogicgames/pi-mono","author":"badlogicgames","disabled":false,"gated":false,"lastModified":"2026-04-06T13:10:36.000Z","likes":172,"trendingScore":8,"private":false,"sha":"dac2a1d3ba12dda597b973a791a77618ccb5f413","description":"\n\t\n\t\t\n\t\n\t\n\t\tCoding agent session traces for badlogicgames/pi-mono\n\t\n\nThis dataset contains redacted coding agent session traces collected while working on https://github.com/badlogic/pi-mono.git. The traces were exported with pi-share-hf from a local pi workspace and filtered to keep only sessions that passed deterministic redaction and LLM review.\n\n\t\n\t\t\n\t\n\t\n\t\tData description\n\t\n\nEach *.jsonl file is a redacted pi session. Sessions are stored as JSON Lines files where each line is a structured… See the full description on the dataset page: https://huggingface.co/datasets/badlogicgames/pi-mono.","downloads":2401,"tags":["task_categories:text-generation","language:en","language:code","license:other","region:us","agent-traces","coding-agent","pi-share-hf"],"createdAt":"2026-04-06T13:07:23.000Z","key":""},{"_id":"6a00ee1b8af2b11e0d2b374b","id":"WithinUsAI/GPT_5.5_Distilled","author":"WithinUsAI","disabled":false,"gated":false,"lastModified":"2026-05-12T17:49:04.000Z","likes":26,"trendingScore":8,"private":false,"sha":"4f49e8c7e98ca80694b7378ff8fed5f7344c5fb3","downloads":951,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-05-10T20:44:11.000Z","key":""},{"_id":"625552d2b339bb03abe3432d","id":"openai/gsm8k","author":"openai","disabled":false,"gated":false,"lastModified":"2026-03-23T10:18:13.000Z","likes":1408,"trendingScore":7,"private":false,"sha":"740312add88f781978c0658806c59bc2815b9866","description":"\n\t\n\t\t\n\t\tDataset Card for GSM8K\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nGSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.\n\nThese problems take between 2 and 8 steps to solve.\nSolutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.","downloads":908873,"paperswithcode_id":"gsm8k","tags":["benchmark:official","benchmark:eval-yaml","task_categories:text-generation","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2110.14168","region:us","math-word-problems"],"createdAt":"2022-04-12T10:22:10.000Z","key":""},{"_id":"67ac9b0ae2c56194379f17a9","id":"SakanaAI/AI-CUDA-Engineer-Archive","author":"SakanaAI","disabled":false,"gated":false,"lastModified":"2025-02-20T02:02:27.000Z","likes":221,"trendingScore":7,"private":false,"sha":"4edbe8d6d0b417e05aaf8ec7e23f78aecdc5516b","description":"\n\t\n\t\t\n\t\n\t\n\t\tThe AI CUDA Engineer Archive 👷: Agentic CUDA Kernel Discovery, Optimization & Composition\n\t\n\n\nWe release The AI CUDA Engineer archive, a dataset consisting of approximately 30,000 CUDA kernels generated by The AI CUDA Engineer. It is released under the CC-By-4.0 license and can be accessed via HuggingFace and interactively visualized here. The dataset is based on the Kernel tasks provided in KernelBench and includes a torch reference implementation, torch, NCU and Clang-tidy… See the full description on the dataset page: https://huggingface.co/datasets/SakanaAI/AI-CUDA-Engineer-Archive.","downloads":3579,"tags":["license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","code"],"createdAt":"2025-02-12T12:58:50.000Z","key":""},{"_id":"6835e8703de5738a2e9af4ae","id":"nvidia/PhysicalAI-Autonomous-Vehicles","author":"nvidia","disabled":false,"gated":"auto","lastModified":"2026-05-06T21:55:22.000Z","likes":927,"trendingScore":7,"private":false,"sha":"b719eea7f0a63619ef51ec7f54178af0937ef050","description":"\n\t\n\t\t\n\t\tPHYSICAL AI AUTONOMOUS VEHICLES\n\t\n\n\nThe PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, geographically diverse collections of multi-sensor data empowering AV researchers to build the next generation of Physical AI based end-to-end driving systems. This dataset is ready for commercial/non-commercial AV use per the license agreement.\n\nData Collection Method\n\nAutomatic/Sensor \n\n\nLabeling Method\n\nAutomatic/Sensor \n\n\n\nThis dataset has a total of 1700 hours of driving… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles.","downloads":205068,"tags":["license:other","region:us"],"createdAt":"2025-05-27T16:29:36.000Z","key":""},{"_id":"685d6910fa047a479aefe2a2","id":"ibm-research/AssetOpsBench","author":"ibm-research","disabled":false,"gated":false,"lastModified":"2026-05-28T17:10:58.000Z","likes":43,"trendingScore":7,"private":false,"sha":"5e25bb7f2cd37fb68b9a9e1f99d170ca5be7ce17","description":"\n\t\n\t\t\n\t\n\t\n\t\tAssetOpsBench\n\t\n\nAssetOpsBench is a specialized benchmark designed for evaluating Large Language Models (LLMs) and Multi-Agent systems in industrial operations. It focuses on the intersection of sensor data interpretation, maintenance logic, and Prognostics and Health Management (PHM).\nThe benchmark enables researchers to test how effectively AI agents can manage complex industrial assets, such as compressors and hydraulic pumps, by applying rule-based logic and diagnostic… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/AssetOpsBench.","downloads":891,"tags":["task_categories:question-answering","task_categories:time-series-forecasting","language:en","license:apache-2.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2506.03828","region:us","Industry","PHM","Predictive-Maintenance","Asset-Management","tool-learning","task-automation","LLM","Multi-Agent"],"createdAt":"2025-06-26T15:36:48.000Z","key":""},{"_id":"69e695a5d20baec02ee3039c","id":"nvidia/Nemotron-Personas-Korea","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-22T15:37:31.000Z","likes":511,"trendingScore":7,"private":false,"sha":"ada0f5b53a38bb5a30cce09358adde883c1ab63a","description":"\n\t\n\t\t\n\t\n\t\n\t\tNemotron-Personas-Korea\n\t\n\n\n  \n  \n    우리나라 실제 분포에 기반한 합성 페르소나를 위한 복합 AI 시스템\n    A compound AI approach to personas grounded in real-world distributions\n  \n\n\n\n\t\n\t\t\n\t\n\t\n\t\t데이터셋 개요 (Overview)\n\t\n\nNemotron-Personas-Korea는 대한민국의 실제 인구통계학적·지리적·성격 특성 분포를 기반으로 합성된 오픈소스 페르소나 데이터셋(CC BY 4.0)으로, 우리나라 인구의 다양성과 특성을 폭넓게 반영하도록 설계되었습니다. 이는 최초의 대규모 우리말 페르소나 데이터셋이며, 이름, 성별, 나이, 혼인 상태, 교육 수준, 직업, 거주 지역 등의 속성을 실제 대한민국 국가데이터처 국가통계포털(KOSIS), 대법원, 국민건강보험공단, 농촌경제연구원, NAVER Cloud 통계 자료를 기반으로 합성하였습니다.… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Personas-Korea.","downloads":12862,"tags":["task_categories:text-generation","language:ko","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","format:optimized-parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","library:datadesigner","region:us","synthetic","personas","NVIDIA","Korean","datadesigner"],"createdAt":"2026-04-20T21:07:49.000Z","key":""},{"_id":"6a1ceae73647a19350d8531b","id":"xlangai/osworld_v2_tasks","author":"xlangai","disabled":false,"gated":"auto","lastModified":"2026-06-24T15:30:35.000Z","likes":9,"trendingScore":7,"private":false,"sha":"796f1aa190229c54cff1f18b46ddf64d928863b0","description":"\n\t\n\t\t\n\t\n\t\n\t\tOSWorld V2 Task Classes\n\t\n\nThis gated dataset contains the official root-level task_*.py Python task classes for OSWorld V2.\nThe public GitHub repository keeps the task loader, helper utilities, and documentation. The task implementations are gated to reduce benchmark leakage and to help prevent evaluated agents from finding task answers, setup logic, or evaluator details online while executing a task.\nDownload from the public repository root with:\nuvx --from huggingface_hub hf… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/osworld_v2_tasks.","downloads":708,"tags":["license:apache-2.0","size_categories:n<1K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-01T02:13:59.000Z","key":""},{"_id":"6a3571c863a799204f75b7bc","id":"STBack23/omnivoice-vi","author":"STBack23","disabled":false,"gated":false,"lastModified":"2026-06-24T19:13:29.000Z","likes":10,"trendingScore":7,"private":false,"sha":"60c0d530d3476e59d31b1dbe8393e12e4f74f037","description":"\n\t\n\t\t\n\t\n\t\n\t\tOmniVoice VI — Giọng Việt + SRT lồng tiếng\n\t\n\nDataset chứa 6 giọng tiếng Việt và công cụ speak.py để chạy trên Google Colab với OmniVoice.\n\n\t\n\t\t\n\t\n\t\n\t\tGiọng có sẵn\n\t\n\n\n\t\n\t\t\nSlug\nTên\n\n\n\t\t\nban_mai\nBan Mai\n\n\nlan_trinh\nLan Trinh\n\n\nngan_ha\nNgan Ha\n\n\nngoc_huyen\nNgoc Huyen\n\n\nthao_trinh\nThao Trinh\n\n\ntuong_vy\nTuong Vy\n\n\n\t\n\nMỗi giọng gồm profile.json, voice.pt (prompt cache), audio mẫu và ref_text.txt.\n\n\t\n\t\t\n\t\n\t\n\t\tChạy trên Colab\n\t\n\n\nMở notebook colab/Omivoice_VI_Colab.ipynb\nĐặt HF_REPO =… See the full description on the dataset page: https://huggingface.co/datasets/STBack23/omnivoice-vi.","downloads":634,"tags":["language:vi","license:apache-2.0","size_categories:n<1K","format:text","modality:audio","modality:text","library:datasets","library:mlcroissant","region:us","text-to-speech","tts","voice-cloning","vietnamese","srt","dubbing","omnivoice"],"createdAt":"2026-06-19T16:43:52.000Z","key":""},{"_id":"6a36e65ae86005a76f1c7adf","id":"ajibawa-2023/Shell-Code-Large","author":"ajibawa-2023","disabled":false,"gated":false,"lastModified":"2026-06-20T19:48:46.000Z","likes":19,"trendingScore":7,"private":false,"sha":"91adad625cc7d91ce983f95466e1bbfb2693fd87","description":"\n\t\n\t\t\n\t\n\t\n\t\tShell-Code-Large\n\t\n\nShell-Code-Large is a large-scale corpus of Shell scripting source code comprising approximately 640,000 code samples stored in JSON Lines (.jsonl) format. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, DevOps automation, cloud infrastructure engineering, system administration, and software engineering automation.\nBy providing a high-volume, language-specific corpus focused exclusively on Shell scripting… See the full description on the dataset page: https://huggingface.co/datasets/ajibawa-2023/Shell-Code-Large.","downloads":299,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","Shell","Code","LLM","Training"],"createdAt":"2026-06-20T19:13:30.000Z","key":""},{"_id":"6a37fba2e45f47278427402a","id":"macrodata/WGO-Bench","author":"macrodata","disabled":false,"gated":false,"lastModified":"2026-06-28T16:50:51.000Z","likes":7,"trendingScore":7,"private":false,"sha":"795dc4429128cfcd288e2245f3ced674fb83322b","description":"\n\t\n\t\t\n\t\n\t\n\t\tWGO-Bench: What's Going On Benchmark\n\t\n\nWGO-Bench is a small, manually annotated benchmark for evaluating how well vision-language models can turn robot and egocentric manipulation videos into timestamped subtask annotations.\nEach row contains one video episode, a high-level task instruction, and gold subtask segments with start time, end time, and a concise action label. The benchmark is designed for two related tasks:\n\nBoundary detection: predict where one meaningful manipulation… See the full description on the dataset page: https://huggingface.co/datasets/macrodata/WGO-Bench.","downloads":420,"tags":["task_categories:robotics","task_categories:video-classification","language:en","license:cc-by-nc-sa-4.0","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","temporal-segmentation","subtask-annotation","robotics","robot-learning","egocentric-video","droid","robocoin","galaxea","homer"],"createdAt":"2026-06-21T14:56:34.000Z","key":""},{"_id":"645e8da96320b0efe40ade7a","id":"roneneldan/TinyStories","author":"roneneldan","disabled":false,"gated":false,"lastModified":"2024-08-12T13:27:26.000Z","likes":1043,"trendingScore":6,"private":false,"sha":"f54c09fd23315a6f9c86f9dc80f725de7d8f9c64","description":"Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary.\nDescribed in the following paper: https://arxiv.org/abs/2305.07759. \nThe models referred to in the paper were trained on TinyStories-train.txt  (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M.\nAdditional resources:\ntinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.","downloads":78878,"tags":["task_categories:text-generation","language:en","license:cdla-sharing-1.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2305.07759","region:us"],"createdAt":"2023-05-12T19:04:09.000Z","key":""},{"_id":"656523d6bfb751371817c448","id":"Idavidrein/gpqa","author":"Idavidrein","disabled":false,"gated":"auto","lastModified":"2026-03-05T23:06:58.000Z","likes":471,"trendingScore":6,"private":false,"sha":"633f5ee89ab8ad4522a9f850766b73f62147ffdd","description":"\n\t\n\t\t\n\t\tDataset Card for GPQA\n\t\n\n\n\nGPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google.\nWe request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.","downloads":94396,"tags":["benchmark:official","benchmark:eval-yaml","task_categories:question-answering","task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2311.12022","region:us","open-domain-qa","open-book-qa","multiple-choice-qa"],"createdAt":"2023-11-27T23:18:46.000Z","key":""},{"_id":"6655eb19d17e141dcb546ed5","id":"HuggingFaceFW/fineweb-edu","author":"HuggingFaceFW","disabled":false,"gated":false,"lastModified":"2025-07-11T20:16:53.000Z","likes":1167,"trendingScore":6,"private":false,"sha":"87f09149ef4734204d70ed1d046ddc9ca3f2b8f9","description":"\n\t\n\t\t\n\t\n\t\n\t\t📚 FineWeb-Edu\n\t\n\n\n    \n\n\n\n1.3 trillion tokens of the finest educational data the 🌐 web has to offer\n\nPaper: https://arxiv.org/abs/2406.17557\n\n\t\n\t\t\n\t\n\t\n\t\tWhat is it?\n\t\n\n📚 FineWeb-Edu  dataset consists of 1.3T tokens  and  5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version.\nTo enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.","downloads":381568,"tags":["task_categories:text-generation","language:en","license:odc-by","size_categories:1B<n<10B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2406.17557","arxiv:2404.14219","arxiv:2401.10020","arxiv:2109.07445","doi:10.57967/hf/2497","region:us"],"createdAt":"2024-05-28T14:32:57.000Z","key":""},{"_id":"6791fcbb49c4df6d798ca7c9","id":"cais/hle","author":"cais","disabled":false,"gated":"auto","lastModified":"2026-01-20T22:42:17.000Z","likes":847,"trendingScore":6,"private":false,"sha":"5a81a4c7271a2a2a312b9a690f0c2fde837e4c29","description":"\n\n\n[!NOTE]\nIMPORTANT: Please help us protect the integrity of this benchmark by not publicly sharing, re-uploading, or distributing the dataset.\n\n\n\t\n\t\t\n\t\tHumanity's Last Exam\n\t\n\n🌐 Website | 📄 Paper |  GitHub\nCenter for AI Safety & Scale AI\n\nHumanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of… See the full description on the dataset page: https://huggingface.co/datasets/cais/hle.","downloads":27678,"tags":["benchmark:official","license:mit","size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-01-23T08:24:27.000Z","key":""},{"_id":"67a404bc8c6d42c5ec097433","id":"Anthropic/EconomicIndex","author":"Anthropic","disabled":false,"gated":false,"lastModified":"2026-06-26T23:21:00.000Z","likes":552,"trendingScore":6,"private":false,"sha":"2ea58ff75e4247d26810c37f10c179edc2466cac","description":"\n\t\n\t\t\n\t\n\t\n\t\tThe Anthropic Economic Index\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nThe Anthropic Economic Index provides insights into how AI is being incorporated into real-world tasks across the modern economy.\n\n\t\n\t\t\n\t\n\t\n\t\tData Releases\n\t\n\nThis repository contains multiple data releases, each with its own documentation:\n\nLabor market impacts: Job exposure and task penetration data\n2026-06-26 Release: Updated analysis with Artifacts and monthly aggregates\n2026-03-24 Release: Updated analysis with Opus… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/EconomicIndex.","downloads":22584,"tags":["language:en","license:mit","arxiv:2503.04761","region:us","AI","LLM","Economic Impacts","Anthropic"],"createdAt":"2025-02-06T00:39:24.000Z","key":""},{"_id":"699f870735497465190f84b5","id":"VINAY-UMRETHE/Sonnet-Opus-4.5-4.6-Gemini-3.0-3.1-Pro-GPT-5-5.1-5.2-GLM-4.7-MiniMax-M2.1-DeepSeek-V3.2-High","author":"VINAY-UMRETHE","disabled":false,"gated":"auto","lastModified":"2026-06-27T07:26:51.000Z","likes":11,"trendingScore":6,"private":false,"sha":"91250376a19118ac938af11b1a7c858ad44ad8a5","description":"\n\t\n\t\t\n\t\n\t\n\t\tDistill\n\t\n\n\n  \n  \n\n\nThis is a multi-source curated instruction and reasoning dataset specifically for training and distilling large language models (LLMs) to exhibit advanced Chain-of-Thought (CoT), Agentic, Mathematical and Coding capabilities. It aggregates high-quality outputs from frontier models into messages ChatML format.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\nThe dataset contains a total of 70.2K examples, split into three subsets based on the presence of visible reasoning… See the full description on the dataset page: https://huggingface.co/datasets/VINAY-UMRETHE/Sonnet-Opus-4.5-4.6-Gemini-3.0-3.1-Pro-GPT-5-5.1-5.2-GLM-4.7-MiniMax-M2.1-DeepSeek-V3.2-High.","downloads":199,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","distillation","agent","code","math","reasoning","non_reasoning","synthetic"],"createdAt":"2026-02-25T23:34:31.000Z","key":""},{"_id":"6a2ec3b4d8c60039982c056b","id":"kelexine/fable-5-sft-traces","author":"kelexine","disabled":false,"gated":false,"lastModified":"2026-06-15T05:42:49.000Z","likes":9,"trendingScore":6,"private":false,"sha":"c0608fa3bdff639e0648fa82adcca59b131e3226","description":"\n\t\n\t\t\n\t\n\t\n\t\tFable-5 SFT Traces\n\t\n\n\nAuthor / maintainer: kelexine (github.com/kelexine)\n\nA cleaned, anonymised, schema-normalised derivative of\nKelexine/Fable-5-traces\n— agentic traces from Fable-5 (claude-fable-5), the model now publicly\nknown as Claude Mythos — Anthropic's top-of-family frontier model at time\nof collection.\nThe dataset supports three fine-tuning shapes off a single JSONL with no\npreprocessing required:\n\n\t\n\t\t\nMode\nFields used\n\n\n\t\t\nFull SFT (thinking + response)\nmessages or… See the full description on the dataset page: https://huggingface.co/datasets/kelexine/fable-5-sft-traces.","downloads":542,"tags":["task_categories:text-generation","language:en","license:agpl-3.0","size_categories:1K<n<10K","format:parquet","format:optimized-parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agentic","reasoning","chain-of-thought","function-calling","tool-use","tool-calling","distillation","claude","code"],"createdAt":"2026-06-14T15:07:32.000Z","key":""},{"_id":"6a303f5537d076ceefbc1821","id":"cfahlgren1/Fable-5-traces","author":"cfahlgren1","disabled":false,"gated":false,"lastModified":"2026-06-15T19:09:24.000Z","likes":15,"trendingScore":6,"private":false,"sha":"0ba6f53852f296f8389290b112054b47cec2dc1f","description":"A simple dataset of the raw Fable 5 Claude session logs we could get our hands on before it was taken away (no clue if it's coming back).\nThe raw trace files live in sessions/*.jsonl. Cache files, paste-cache files, shell history, and merged COT training exports are intentionally omitted so Hugging Face Datasets can load the repo through the agent-traces path.\n\n\t\n\t\t\n\t\n\t\n\t\tA pretty viewer for dataset:… See the full description on the dataset page: https://huggingface.co/datasets/cfahlgren1/Fable-5-traces.","downloads":1285,"tags":["license:agpl-3.0","size_categories:n<1K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-15T18:07:17.000Z","key":""},{"_id":"6a3197aa0c51fd401abb1814","id":"TuringEnterprises/Rubric-Graded-Reasoning","author":"TuringEnterprises","disabled":false,"gated":false,"lastModified":"2026-06-16T18:50:57.000Z","likes":12,"trendingScore":6,"private":false,"sha":"bc31dbd529592b64ec0de71fea3c89956fec66cc","description":"\n\t\n\t\t\n\t\n\t\n\t\tRubrics-Graded Reasoning — Computer Science, Data Science, Chemistry\n\t\n\nA multi-domain reasoning dataset built to improve frontier models by revealing their failures and turning expert grading into training signal.\nThe dataset pairs self-contained tasks with weighted rubrics across three domains — Computer Science, Data Science, and Chemistry — turning expert evaluation into training signals that boost frontier-model reasoning.\nExplore the full Rubric-based reasoning data pack:… See the full description on the dataset page: https://huggingface.co/datasets/TuringEnterprises/Rubric-Graded-Reasoning.","downloads":469,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2009.03300","arxiv:2311.12022","region:us","rl","rubric-evaluation","rubrics-graded","llm-eval","computer-science","data-science","chemistry","scientific-reasoning"],"createdAt":"2026-06-16T18:36:26.000Z","key":""},{"_id":"6a347406d390563cd60d032d","id":"allenai/tmax-15k-open-instruct","author":"allenai","disabled":false,"gated":false,"lastModified":"2026-06-26T23:11:46.000Z","likes":8,"trendingScore":6,"private":false,"sha":"7b090eca98bf351356bc1c64290c5c4a09f2f98c","description":"\n\n  💻 Code ·\n  🤗 Models & Data ·\n  📜 Paper ·\n  📓 Blog\n\n\n\n[!NOTE]\nFor full information, go check out the Tmax paper here.\n\n\n\t\n\t\t\n\t\n\t\n\t\tTMax 15k - Open Instruct\n\t\n\nThis is the dataset we used to train Tmax 9b (and our other tmax models), formatted for use with our open-instruct fork here.\nIn general, this is a collection of roughly 15k RL environment instances.\nFor details on how we generated this dataset and its makeup, please see our paper!\nYou can find a more generic version of this… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tmax-15k-open-instruct.","downloads":563,"tags":["language:en","license:odc-by","size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.23321","region:us"],"createdAt":"2026-06-18T22:41:10.000Z","key":""},{"_id":"6a36673934f0213d9c1e6704","id":"ClSu/ember-features","author":"ClSu","disabled":false,"gated":false,"lastModified":"2026-06-20T12:01:48.000Z","likes":8,"trendingScore":6,"private":false,"sha":"9c08d5ee46efb5e66c7d86640d6e03ed1742d50f","description":"\n\t\n\t\t\n\t\n\t\n\t\tEMBER precomputed features\n\t\n\nConcept features for EMBedding ERasure (EMBER), a plug-and-play module that uses\nSparse Matrix Factorization to precisely erase concept-related features from token\nembeddings, making existing erasure methods more robust to relearning.\nFor each concept, two factorizations are provided:\n\nEmbedding features (EMBER): a sparse factorization of the token-embedding matrix.\nMLP features (SNMF): Semi-NMF over MLP activations.\n\nModels: google/gemma-2-2b-it (rank… See the full description on the dataset page: https://huggingface.co/datasets/ClSu/ember-features.","downloads":919,"tags":["license:mit","size_categories:1K<n<10K","format:csv","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2606.03695","region:us"],"createdAt":"2026-06-20T10:11:05.000Z","key":""},{"_id":"6a3871555ae47cd0bf948930","id":"FrontisAI/NatureBench","author":"FrontisAI","disabled":false,"gated":false,"lastModified":"2026-06-25T08:26:20.000Z","likes":6,"trendingScore":6,"private":false,"sha":"ee848844b68de031fad462ccdf9a8f579e5e4bd6","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for NatureBench\n\t\n\n\n\nNatureBench is a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, spanning 6 scientific domains. It is designed to evaluate whether AI coding agents can move beyond reproduction toward discovery: each task asks an agent to solve a real scientific machine-learning problem and is scored against the source paper's reported state of the art.\n\n📄 arXiv paper: https://arxiv.org/abs/2606.24530\n💻 GitHub code… See the full description on the dataset page: https://huggingface.co/datasets/FrontisAI/NatureBench.","downloads":18557,"tags":["language:en","license:other","size_categories:n<1K","format:json","modality:image","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.24530","region:us","coding-agents","benchmark","scientific-machine-learning","nature"],"createdAt":"2026-06-21T23:18:45.000Z","key":""},{"_id":"6a3a9fd3f6664a1b13e59781","id":"AstroAutomata/ThousandWorlds","author":"AstroAutomata","disabled":false,"gated":false,"lastModified":"2026-06-23T15:01:40.000Z","likes":6,"trendingScore":6,"private":false,"sha":"8493801b64b2ffef2717e4042a1dc61002003543","description":"\n\t\n\t\t\n\t\n\t\n\t\tThousandWorlds\n\t\n\n\n\nThousandWorlds is a benchmark for emulating exoplanet climates: 1760\nsimulations across 5 GCMs, 8 planet parameters, and atmospheric\nvariables on a 32 x 64 x 10 latitude-longitude-pressure grid. It includes three\nnested benchmark subsets, two evaluation protocols, and eight released baseline\nmethods.\n\n\nInputs are 8 continuous planet parameters plus the source GCM label. Outputs\nare time-averaged climate fields on a 32 x 64 latitude-longitude grid:… See the full description on the dataset page: https://huggingface.co/datasets/AstroAutomata/ThousandWorlds.","downloads":296,"tags":["task_categories:tabular-regression","task_categories:other","license:cc-by-4.0","size_categories:1K<n<10K","library:datasets","arxiv:2606.18338","region:us","benchmark","datasets","physical-sciences","scientific-machine-learning","exoplanets","climate","astronomy","emulation","simulation","physics","pde","parameter-to-field-regression","structured-outputs","multi-simulator-transfer","spatiotemporal"],"createdAt":"2026-06-23T15:01:39.000Z","key":""},{"_id":"6a3b58f031ecca7eacf9ec84","id":"KlingTeam/UnityShotsBench","author":"KlingTeam","disabled":false,"gated":false,"lastModified":"2026-06-24T12:45:27.000Z","likes":6,"trendingScore":6,"private":false,"sha":"90b6bbfb1961f665a13c51640f58b20eef57d953","description":"\n\t\n\t\t\n\t\n\t\n\t\tUnityShots Benchmark\n\t\n\nA multilingual, multi-cultural k-shot storytelling benchmark for evaluating multi-shot\naudio-video generation. Each case is a short cinematic story told across several shots, with a\nconsistent cast whose identity, voice, and world must persist across every cut.\nThis is the evaluation benchmark released with UnityShots: Memory-Driven Multi-Shot\nAudio-Video Generation with Boundary-Aware Gating.\n\n📄 Paper: arXiv:2606.21661\n🌐 Project page:… See the full description on the dataset page: https://huggingface.co/datasets/KlingTeam/UnityShotsBench.","downloads":1389,"tags":["task_categories:text-to-video","task_categories:image-to-video","language:zh","language:yue","language:en","language:de","language:es","language:ar","language:hi","language:bn","language:sw","language:yo","language:fa","language:pt","language:vi","license:cc-by-nc-4.0","size_categories:1K<n<10K","format:imagefolder","modality:audio","modality:image","library:datasets","library:mlcroissant","arxiv:2606.21661","region:us","multi-shot","audio-video-generation","storytelling","video-generation","talking-head","multilingual"],"createdAt":"2026-06-24T04:11:28.000Z","key":""},{"_id":"63990f21cc50af73d29ecfa3","id":"fka/prompts.chat","author":"fka","disabled":false,"gated":false,"lastModified":"2026-06-30T04:30:23.000Z","likes":9749,"trendingScore":5,"private":false,"sha":"f06e4dd97672b38ca20ac025a3e3d7b2ff823dd4","description":"\n  \n  \n  a.k.a. Awesome ChatGPT Prompts\n\n\nThis is a Dataset Repository mirror of prompts.chat — a social platform for AI prompts.\n\n\t\n\t\t\n\t\n\t\n\t\t📢 Notice\n\t\n\nThis Hugging Face dataset is a mirror. For the latest prompts, features, and community contributions, please visit:\n\n🌐 Website: prompts.chat\n📦 GitHub: github.com/f/awesome-chatgpt-prompts\n\n\n\t\n\t\t\n\t\n\t\n\t\tAbout\n\t\n\nprompts.chat is an open-source platform where users can share, discover, and collect AI prompts from the community. The project can… See the full description on the dataset page: https://huggingface.co/datasets/fka/prompts.chat.","downloads":31188,"tags":["task_categories:question-answering","task_categories:text-generation","license:cc0-1.0","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","ChatGPT","prompts","AI","GPT","Claude","Gemini","Llama","Mistral","LLM","prompt-engineering","conversational-ai","text-generation","chatbot","awesome-list"],"createdAt":"2022-12-13T23:47:45.000Z","key":""},{"_id":"6532270e829e1dc2f293d6b8","id":"gaia-benchmark/GAIA","author":"gaia-benchmark","disabled":false,"gated":"auto","lastModified":"2025-10-28T14:44:54.000Z","likes":706,"trendingScore":5,"private":false,"sha":"682dd723ee1e1697e00360edccf2366dc8418dd9","description":"\n\t\n\t\t\n\t\n\t\n\t\tGAIA dataset\n\t\n\nGAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc).\nWe added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format.\n\n\t\n\t\t\n\t\n\t\n\t\tData and leaderboard\n\t\n\nGAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to… See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.","downloads":26931,"tags":["language:en","size_categories:n<1K","format:parquet","modality:audio","modality:document","modality:image","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2311.12983","region:us"],"createdAt":"2023-10-20T07:06:54.000Z","key":""},{"_id":"66561c5d5b8ab1ed4f7a21af","id":"mlabonne/harmful_behaviors","author":"mlabonne","disabled":false,"gated":false,"lastModified":"2024-06-04T10:45:47.000Z","likes":135,"trendingScore":5,"private":false,"sha":"01cead01398926d81f7c52bdb790ee8cf77ebba7","downloads":16330,"tags":["language:en","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-05-28T18:03:09.000Z","key":""},{"_id":"68ae11cd78570b7e4c66edba","id":"ScaleAI/SWE-bench_Pro","author":"ScaleAI","disabled":false,"gated":false,"lastModified":"2026-02-23T20:54:47.000Z","likes":142,"trendingScore":5,"private":false,"sha":"7ab5114912baf22bb098818e604c02fe7ad2c11f","description":"\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nSWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks.\nPaper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf\nSee the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\nWe follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.","downloads":67614,"tags":["benchmark:official","benchmark:eval-yaml","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-08-26T19:58:05.000Z","key":""},{"_id":"69b0d60653fc22f5d5d3c07f","id":"agibot-world/AgiBotWorld2026","author":"agibot-world","disabled":false,"gated":false,"lastModified":"2026-05-27T10:14:56.000Z","likes":46,"trendingScore":5,"private":false,"sha":"6acbe225045ca2f3a4817261ab54638e1c155aa3","description":"\n\n\n\t\n\t\t\n\t\tAgiBot World 2026\n\t\n\nReal-World Embodied Intelligence Dataset\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\n\t\n\t\tOverview\n\t\n\nAs robotics research advances into real-world scenarios, the demand for authentic, high-quality data has become increasingly urgent. Following AGIBOT WORLD's \"ImageNet moment,\" we now release the AGIBOT WORLD 2026 dataset. Built upon massive real-world scenes, it systematically spans pivotal research directions in embodied intelligence, designed to power the next generation of embodied… See the full description on the dataset page: https://huggingface.co/datasets/agibot-world/AgiBotWorld2026.","downloads":36624,"tags":["task_categories:robotics","language:en","license:cc-by-nc-sa-4.0","size_categories:1K<n<10K","modality:image","modality:text","region:us","agibot","imitation-learning","embodied-ai","lerobot","real-world","dual-arm"],"createdAt":"2026-03-11T02:40:06.000Z","key":""},{"_id":"69ca9b695a4dac480491fd13","id":"lambda/hermes-agent-reasoning-traces","author":"lambda","disabled":false,"gated":false,"lastModified":"2026-04-17T10:06:39.000Z","likes":370,"trendingScore":5,"private":false,"sha":"b92885e4f0161d4b2536512710e004d4892cac6e","description":"\n\t\n\t\t\n\t\tHermes Agent Reasoning Traces\n\t\n\nMulti-turn tool-calling trajectories for training AI agents using the Hermes Agent harness. Each sample is a real agent conversation with step-by-step reasoning (<think> blocks) and actual tool execution results.\nThis dataset has two configs, one per source model:\n\n\t\n\t\t\nConfig\nModel\nSamples\n\n\n\t\t\nkimi\nMoonshot AI Kimi-K2.5\n7,646\n\n\nglm-5.1\nZhipuAI GLM-5.1-FP8\n7,055\n\n\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tLoading\n\t\n\nfrom datasets import load_dataset\n\n# Kimi-K2.5 traces\nds =… See the full description on the dataset page: https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces.","downloads":2997,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","tool-calling","function-calling","agent","hermes","reasoning","sharegpt","sft","traces"],"createdAt":"2026-03-30T15:48:57.000Z","key":""},{"_id":"69d4b18e5a6a59b86c749173","id":"Cseti/ComfyUI-Workflows","author":"Cseti","disabled":false,"gated":false,"lastModified":"2026-06-24T11:10:01.000Z","likes":19,"trendingScore":5,"private":false,"sha":"193b4adb627e695de347253d692698f78000f62f","description":"\n\t\n\t\t\n\t\n\t\n\t\tComfyUI Workflows\n\t\n\nA personal collection of ComfyUI workflows for AI video generation, upscaling, and video editing.\nEach workflow entry includes a description, example output, requirements, and usage notes.\nDrag-and-drop any .json file directly into ComfyUI to load the workflow.\n\n\n\t\n\t\t\n\t\n\t\n\t\tPosts\n\t\n\n\n\t\n\t\t\nPost\nDate\n\n\n\t\t\nMOMENTS — Making of\n2026-03-30\n\n\n\t\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tLTX Video 2.3\n\t\n\n\n\t\n\t\t\nWorkflow\nDescription\nType\nUpdated\n\n\n\t\t\nIC-LoRA Cameraman v2\nI2V with IC-LoRA camera… See the full description on the dataset page: https://huggingface.co/datasets/Cseti/ComfyUI-Workflows.","downloads":1453,"tags":["region:us"],"createdAt":"2026-04-07T07:26:06.000Z","key":""},{"_id":"6a1aae5368425ef2b92ee346","id":"TrueNix/ctf-solver-dataset","author":"TrueNix","disabled":false,"gated":false,"lastModified":"2026-05-30T10:52:05.000Z","likes":7,"trendingScore":5,"private":false,"sha":"a7afd66b69800ed2f8b7f40dea0bec9a11dbfea2","downloads":279,"tags":["size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-05-30T09:30:59.000Z","key":""},{"_id":"6a1d8b0a3c0febf8a426c414","id":"doctolib-lab/finemed-fr","author":"doctolib-lab","disabled":false,"gated":false,"lastModified":"2026-06-23T06:51:44.000Z","likes":5,"trendingScore":5,"private":false,"sha":"44e02c66ee930786534bfbe091f95b02537d51e0","description":"\n\t\n\t\t\n\t\n\t\n\t\tFineMed-fr\n\t\n\n\n  \n\n\n\n🤗 Blog |\n📄 Paper |\n💻 Code |\n🌐 FineMed |\n🩺 DoctoBERT\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📚 Introduction\n\t\n\nFineMed-fr is a large, openly available corpus of French medical text for language-model pretraining: 21.1M documents and 19.2B words of real-world medical writing, annotated along several quality axes.\nThe corpus is drawn from three heterogeneous open-web sources (FineWeb-2,\nFinePDFs, and\nFineWiki), which together provide the scale, source\ndiversity, and stylistic range… See the full description on the dataset page: https://huggingface.co/datasets/doctolib-lab/finemed-fr.","downloads":1632,"tags":["task_categories:fill-mask","task_categories:text-generation","language:fr","license:odc-by","license:cc-by-sa-4.0","size_categories:10M<n<100M","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2606.22079","region:us","medical","healthcare","biomedical","clinical","french","pretraining","data-filtering","fineweb-2","finepdfs","finewiki"],"createdAt":"2026-06-01T13:37:14.000Z","key":""},{"_id":"6a321ec1abfe225c52eeedb7","id":"Omarrran/Koshur_Pixel","author":"Omarrran","disabled":false,"gated":"manual","lastModified":"2026-06-23T04:53:37.000Z","likes":5,"trendingScore":5,"private":false,"sha":"874c4df69c3f244eed6cc018d8d8a9fbf09ce250","description":"\n\t\n\t\t\n\t\n\t\n\t\tKoshur Pixel\n\t\n\nKoshur Pixel is a synthetic optical character recognition (OCR) dataset for Kashmiri (ks), in which Unicode text is rendered to images using Nastaliq / Naksh-style fonts and paired with its exact transcription. It is built for OCR recognition, image-to-text modeling, fine-tuning, and evaluation workflows that need clean, controllable image/label pairs at scale.\nBecause the text is rendered programmatically, every image ships with a perfectly aligned ground-truth… See the full description on the dataset page: https://huggingface.co/datasets/Omarrran/Koshur_Pixel.","downloads":359,"tags":["task_categories:image-to-text","task_categories:text-to-image","language:ks","license:cc-by-nd-4.0","size_categories:1M<n<10M","modality:image","modality:text","arxiv:2606.23144","region:us","ocr","synthetic","kashmiri","nastaliq","nakash"],"createdAt":"2026-06-17T04:12:49.000Z","key":""},{"_id":"6a35605763a799204f741823","id":"jdopensource/JoyAI-VL-Interaction","author":"jdopensource","disabled":false,"gated":false,"lastModified":"2026-06-23T08:06:07.000Z","likes":11,"trendingScore":5,"private":false,"sha":"6cf9b8383e0fcfeecfe8c12d44499b01b931c98e","description":"\n\t\n\t\t\n\t\n\t\n\t\tJoyAI-VL-Interaction Dataset\n\t\n\nThe first open, vision-driven real-time interaction model — it watches a live video stream and decides on its own when to speak, stay silent, or delegate.\n📄 Paper · 🌐 Project Page & Demos · 💻 GitHub · 🤗 Paper Page\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nThis repository contains the aligned interaction training data released as part of the JoyAI-VL-Interaction project. The dataset consists of over 4 million time-aligned video-language interaction… See the full description on the dataset page: https://huggingface.co/datasets/jdopensource/JoyAI-VL-Interaction.","downloads":762,"tags":["task_categories:video-text-to-text","license:apache-2.0","arxiv:2606.14777","region:us"],"createdAt":"2026-06-19T15:29:27.000Z","key":""},{"_id":"6a3649e9d999d2601908f1eb","id":"hotdogs/uka-fable-reasoning","author":"hotdogs","disabled":false,"gated":"manual","lastModified":"2026-06-30T05:45:58.000Z","likes":9,"trendingScore":5,"private":false,"sha":"53780c11a9379a6349ac353c7b6ae21b90881726","description":"\n\t\n\t\t\n\t\n\t\n\t\t🧠 uka-fable-reasoning v2.1\n\t\n\n\nCurated reasoning dataset for fine-tuning LLMs as autonomous agents.\n7,482 rows — ChatML format, multi-turn conversations, all quality issues fixed.\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📊 Dataset Overview\n\t\n\n\n\t\n\t\t\nSplit\nRows\nMessages\nSize (JSONL)\nSize (Parquet)\n\n\n\t\t\nTrain\n6,359\n3–137\n49.4 MB\n22.1 MB\n\n\nValid\n748\n3–10+\n5.8 MB\n2.5 MB\n\n\nTest\n375\n3–10+\n3.0 MB\n1.3 MB\n\n\nTotal\n7,482\n\n58.2 MB\n25.9 MB\n\n\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tConversation Types\n\t\n\n\n\t\n\t\t\nType\nCount\nDescription… See the full description on the dataset page: https://huggingface.co/datasets/hotdogs/uka-fable-reasoning.","downloads":227,"tags":["language:en","license:agpl-3.0","size_categories:10K<n<100K","modality:text","region:us","reasoning","agentic","sft","chain-of-thought","multi-turn","tool-use","chatml"],"createdAt":"2026-06-20T08:06:01.000Z","key":""},{"_id":"6a3d72c815fc3044e3ef80c0","id":"nvidia/Cosmos3-DROID","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-25T18:35:34.000Z","likes":5,"trendingScore":5,"private":false,"sha":"5c11a20accb11497270a5247a7f1e66ad04c956c","description":"\n\t\n\t\t\n\t\n\t\n\t\tDROID: Distributed Robot Interaction Dataset\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nDROID (Distributed Robot Interaction Dataset) is a large-scale \"in-the-wild\" robot manipulation dataset containing 76K teleoperated demonstration trajectories — approximately 350 hours of interaction data — collected across 564 unique scenes, 86 tasks, and 52 buildings over the course of 12 months. The data was collected by 50 data collectors at 18 labs across 13 institutions in North America, Asia, and… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Cosmos3-DROID.","downloads":12660,"tags":["license:openmdw-1.1","size_categories:1K<n<10K","modality:video","library:datasets","library:mlcroissant","arxiv:2403.12945","region:us"],"createdAt":"2026-06-25T18:26:16.000Z","key":""},{"_id":"621ffdd236468d709f181e5e","id":"cais/mmlu","author":"cais","disabled":false,"gated":false,"lastModified":"2024-03-08T20:36:26.000Z","likes":778,"trendingScore":4,"private":false,"sha":"c30699e8356da336a370243923dbaf21066bb9fe","description":"\n\t\n\t\t\n\t\tDataset Card for MMLU\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nMeasuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).\nThis is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.","downloads":428700,"paperswithcode_id":"mmlu","tags":["task_categories:question-answering","task_ids:multiple-choice-qa","annotations_creators:no-annotation","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2009.03300","arxiv:2005.00700","arxiv:2005.14165","arxiv:2008.02275","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"640f5b2fb63b6f18522d6d44","id":"tatsu-lab/alpaca","author":"tatsu-lab","disabled":false,"gated":false,"lastModified":"2023-05-22T20:33:36.000Z","likes":999,"trendingScore":4,"private":false,"sha":"dce01c9b08f87459cf36a430d809084718273017","description":"\n\t\n\t\t\n\t\tDataset Card for Alpaca\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nAlpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.\nThe authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:\n\nThe text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.","downloads":73964,"tags":["task_categories:text-generation","language:en","license:cc-by-nc-4.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","instruction-finetuning"],"createdAt":"2023-03-13T17:19:43.000Z","key":""},{"_id":"64382440c212a363c3ac15c8","id":"OpenAssistant/oasst1","author":"OpenAssistant","disabled":false,"gated":false,"lastModified":"2023-05-02T13:21:21.000Z","likes":1539,"trendingScore":4,"private":false,"sha":"fdf72ae0827c1cda404aff25b6603abec9e3399b","description":"\n\t\n\t\t\n\t\tOpenAssistant Conversations Dataset (OASST1)\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nIn an effort to democratize research on large-scale alignment, we release OpenAssistant \nConversations (OASST1), a human-generated, human-annotated assistant-style conversation \ncorpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 \nquality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus \nis a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.","downloads":15529,"tags":["language:en","language:es","language:ru","language:de","language:pl","language:th","language:vi","language:sv","language:bn","language:da","language:he","language:it","language:fa","language:sk","language:id","language:nb","language:el","language:nl","language:hu","language:eu","language:zh","language:eo","language:ja","language:ca","language:cs","language:bg","language:fi","language:pt","language:tr","language:ro","language:ar","language:uk","language:gl","language:fr","language:ko","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2304.07327","region:us","human-feedback"],"createdAt":"2023-04-13T15:48:16.000Z","key":""},{"_id":"653657e07139c5dd8dc15650","id":"jxu124/OpenX-Embodiment","author":"jxu124","disabled":false,"gated":false,"lastModified":"2024-10-16T07:25:56.000Z","likes":114,"trendingScore":4,"private":false,"sha":"3cde9e791b4f9d3ad93efcdd1e47747f2c37edd1","description":"\n\t\n\t\t\n\t\tOpen X-Embodiment Dataset (unofficial)\n\t\n\nThis is an unofficial Dataset Repo. This Repo is set up to make Open X-Embodiment Dataset (55 in 1) more accessible for people who love huggingface🤗.\nOpen X-Embodiment Dataset is the largest open-source real robot dataset to date. It contains 1M+ real robot trajectories spanning 22 robot embodiments, from single robot arms to bi-manual robots and quadrupeds.\nMore information is located on RT-X website… See the full description on the dataset page: https://huggingface.co/datasets/jxu124/OpenX-Embodiment.","downloads":18519,"tags":["task_categories:robotics","task_categories:reinforcement-learning","language:en","license:cc-by-4.0","size_categories:1M<n<10M","region:us","Robotics"],"createdAt":"2023-10-23T11:24:16.000Z","key":""},{"_id":"65d79d224f7ca8579b9e5e84","id":"MathLLMs/MathVision","author":"MathLLMs","disabled":false,"gated":false,"lastModified":"2026-06-10T07:04:16.000Z","likes":150,"trendingScore":4,"private":false,"sha":"2837ddb3f13abaf6b3997c12d80753e5470bd46a","description":"\n\t\n\t\t\n\t\n\t\n\t\tMeasuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset\n\t\n\n[💻 Github] [🌐 Homepage]  [📊 Main Leaderboard ] [📊 Open Source Leaderboard ] [🌿 Wild Leaderboard ] [🔍 Visualization] [📖 Paper]\n\n\n\t\n\t\t\n\t\n\t\n\t\t🌿 NEW: MATH-Vision-Wild\n\t\n\nMATH-Vision-Wild is a photographic, real-world variant of MATH-Vision. The same testmini problems are physically captured on printed paper, iPads, laptops, and projectors under varying lighting and angles — the conditions VLMs actually… See the full description on the dataset page: https://huggingface.co/datasets/MathLLMs/MathVision.","downloads":11277,"tags":["task_categories:question-answering","task_categories:multiple-choice","task_categories:visual-question-answering","task_categories:text-generation","task_categories:image-to-text","task_categories:image-text-to-text","annotations_creators:expert-generated","annotations_creators:found","language_creators:expert-generated","language_creators:found","language:en","license:mit","size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2501.12599","arxiv:2402.14804","region:us","mathematics","reasoning","multi-modal-qa","math-qa","figure-qa","geometry-qa","math-word-problem","textbook-qa","vqa","geometry-diagram","synthetic-scene","chart","plot","scientific-figure","table","function-plot","abstract-scene","puzzle-test","document-image","science"],"createdAt":"2024-02-22T19:14:42.000Z","key":""},{"_id":"65dc13085ca10be41fdd8b27","id":"bigcode/the-stack-v2","author":"bigcode","disabled":false,"gated":"auto","lastModified":"2024-04-23T15:52:32.000Z","likes":590,"trendingScore":4,"private":false,"sha":"7408bfbcfd48e5833d62fd3dba48afd20d109473","description":"\n\t\n\t\t\n\t\tThe Stack v2\n\t\n\n\n    \n\n\nThe dataset consists of 4 versions:\n\nbigcode/the-stack-v2: the full \"The Stack v2\" dataset <-- you are here\nbigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated\nbigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories.bigcode/the-stack-v2-train-smol-ids: based on the… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2.","downloads":28836,"tags":["task_categories:text-generation","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","language:code","license:other","size_categories:1B<n<10B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2402.19173","arxiv:2107.03374","arxiv:2207.14157","region:us"],"createdAt":"2024-02-26T04:26:48.000Z","key":""},{"_id":"67c940995477d45b871a8f1c","id":"ai4bharat/IndicVoices","author":"ai4bharat","disabled":false,"gated":"auto","lastModified":"2026-06-15T03:43:22.000Z","likes":71,"trendingScore":4,"private":false,"sha":"c96f9088f138cf89d419da7e8e643e1f05c00a87","description":"\n\t\n\t\t\n\t\n\t\n\t\tIndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages\n\t\n\n\n  \n  \n  \n\n\n\n\t\n\t\t\n\t\n\t\n\t\tUpdates\n\t\n\n\n[23 December 2025] We now have 11,200 hours of transcribed data! 🎉\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nINDICVOICES is a dataset of natural and spontaneous speech containing a total of 23.7K hours of read (8%), extempore (76%) and conversational (15%) audio from 51K speakers covering 400+ Indian districts and 22 languages. Of these 23.7K hours, 11.2K hours have… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/IndicVoices.","downloads":11220,"tags":["license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","format:optimized-parquet","modality:audio","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2403.01926","region:us"],"createdAt":"2025-03-06T06:28:41.000Z","key":""},{"_id":"682600d8e6a0ae86702e3da9","id":"nvidia/Granary","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-26T01:41:30.000Z","likes":204,"trendingScore":4,"private":false,"sha":"0fe23a860e3570b111e79a05ebeefc53822161e2","description":"\n\t\n\t\t\n\t\n\t\n\t\tGranary: Speech Recognition and Translation Dataset in 25 European Languages\n\t\n\nGranary is a large-scale, open-source multilingual speech dataset covering 25 European languages for Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) tasks. \n\n\n\n\t\n\t\t\n\n\n\n\n\t\t\n\n\n\n\n\t\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nGranary addresses the scarcity of high-quality speech data for low-resource languages by consolidating multiple datasets under a unified framework:\n\n🗣️ ~1M hours of… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Granary.","downloads":3935,"tags":["task_categories:automatic-speech-recognition","task_categories:translation","language:bg","language:cs","language:da","language:de","language:el","language:en","language:es","language:et","language:fi","language:fr","language:hr","language:hu","language:it","language:lt","language:lv","language:mt","language:nl","language:pl","language:pt","language:ro","language:ru","language:sk","language:sl","language:sv","language:uk","license:cc-by-4.0","size_categories:100M<n<1B","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2406.00899","arxiv:2505.13404","region:us","granary","multilingual","nemo"],"createdAt":"2025-05-15T14:57:28.000Z","key":""},{"_id":"684857b1cf25261f166c09d9","id":"facebook/bouquet","author":"facebook","disabled":false,"gated":"auto","lastModified":"2026-05-28T12:50:46.000Z","likes":44,"trendingScore":4,"private":false,"sha":"9a6070a9652e350dda1d353c4fd198533199a911","description":"\n\t\n\t\t\n\t\n\t\n\t\tBOUQuET 💐: Benchmark and Open initiative for Universal Quality Evaluation in Translation\n\t\n\nBOUQuET is a multi-way parallel, multi-centric and multi-register/domain dataset and benchmark for machine translation quality.\nThe underlying texts have been handcrafted by linguists in 8 diverse languages (Egyptian Arabic, French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish) and translated to English and 266 other languoids (language + script combinations). The… See the full description on the dataset page: https://huggingface.co/datasets/facebook/bouquet.","downloads":2123,"tags":["task_categories:translation","language_creators:expert-generated","language:aa","language:aar","language:abl","language:af","language:afr","language:agr","language:aiq","language:als","language:am","language:amh","language:ami","language:ane","language:apc","language:arh","language:arn","language:arz","language:as","language:asm","language:ayr","language:ayz","language:azb","language:azj","language:azm","language:azz","language:ba","language:bak","language:bam","language:bas","language:bba","language:be","language:bel","language:ben","language:bft","language:bg","language:bhb","language:bho","language:bm","language:bn","language:bo","language:bod","language:bos","language:br","language:bre","language:brh","language:brx","language:bs","language:bsh","language:bsk","language:bul","language:ca","language:cak","language:cat","language:ce","language:ceb","language:ces","language:che","language:chr","language:chv","language:cja","language:cjk","language:ckb","language:ckl","language:cmn","language:crk","language:cs","language:cux","language:cv","language:cy","language:cym","language:da","language:dan","language:daq","language:de","language:deu","language:dgo","language:dik","language:diq","language:div","language:djc","language:dje","language:dtm","language:dts","language:dua","language:dv","language:dz","language:dzo","language:ekk","language:el","language:ell","language:en","language:enb","language:eng","language:enl","language:es","language:eto","language:eu","language:eus","language:ewo","language:fao","language:fi","language:fia","language:fin","language:fo","language:fr","language:fra","language:fry","language:fuc","language:fuv","language:fvr","language:fy","language:ga","language:gax","language:gaz","language:gd","language:gil","language:gkp","language:gl","language:gla","language:gle","language:glg","language:gom","language:gu","language:guc","language:gug","language:guj","language:guz","language:gxx","language:ha","language:hat","language:hau","language:he","language:heb","language:heh","language:hi","language:hin","language:hne","language:hr","language:hrv","language:ht","language:hu","language:hun","language:hve","language:hy","language:hye","language:ibo","language:id","language:ig","language:ijc","language:ilo","language:ind","language:irk","language:is","language:isl","language:it","language:ita","language:ja","language:jav","language:jmc","language:jnj","language:jpn","language:jv","language:ka","language:kaa","language:kac","language:kai","language:kal","language:kam","language:kan","language:kat","language:kaz","language:kea","language:kek","language:khk","language:khm","language:khq","language:khw","language:kin","language:kir","language:kk","language:kl","language:kls","language:km","language:kmb","language:kmr","language:kn","language:knc","language:knw","language:ko","language:kor","language:krt","language:kru","language:ksf","language:ktu","language:kuj","language:kwy","language:kxp","language:ky","language:lao","language:led","language:lg","language:lgg","language:li","language:lij","language:lim","language:lin","language:lir","language:lit","language:ln","language:lo","language:loa","language:loh","language:lt","language:lug","language:luo","language:lvs","language:maf","language:mai","language:mal","language:mam","language:mar","language:mas","language:mey","language:mi","language:mie","language:min","language:miq","language:mk","language:mkd","language:ml","language:mlt","language:mos","language:mr","language:mri","language:mt","language:mtq","language:my","language:mya","language:mzl","language:naq","language:nhe","language:nl","language:nld","language:nlv","language:nn","language:nno","language:npi","language:nso","language:nus","language:ny","language:nya","language:ory","language:pa","language:pan","language:pbs","language:pbt","language:pcm","language:pes","language:pl","language:plt","language:pol","language:por","language:pt","language:quc","language:quh","language:quz","language:rm","language:ro","language:rob","language:roh","language:ron","language:ru","language:rus","language:rw","language:sat","language:sba","language:scn","language:sd","language:se","language:sgc","language:shn","language:si","language:sif","language:sin","language:sk","language:skr","language:sl","language:slk","language:slv","language:sme","language:sn","language:sna","language:snd","language:so","language:som","language:sot","language:spa","language:sr","language:sro","language:srp","language:ss","language:ssw","language:st","language:su","language:sun","language:sv","language:swe","language:swh","language:szl","language:ta","language:tam","language:taq","language:tat","language:tda","language:te","language:tel","language:tg","language:tgk","language:tgl","language:th","language:tha","language:ti","language:tir","language:tl","language:tn","language:toc","language:tpi","language:tpl","language:tr","language:ts","language:tsg","language:tsn","language:tso","language:tsz","language:tt","language:tui","language:tur","language:tw","language:twi","language:tzh","language:tzm","language:ug","language:uig","language:uk","language:ukr","language:umb","language:ur","language:urd","language:uzn","language:ve","language:ven","language:vi","language:vie","language:vmw","language:war","language:wlv","language:wo","language:wol","language:wuu","language:xh","language:xho","language:xuu","language:ydd","language:ydg","language:yo","language:yor","language:yua","language:yue","language:zai","language:zne","language:zsm","language:zu","language:zul","license:cc-by-4.0","size_categories:1K<n<10K","arxiv:2502.04314","arxiv:2603.16309","arxiv:2205.08533","region:us"],"createdAt":"2025-06-10T16:05:05.000Z","key":""},{"_id":"68b427593de456247d6a9341","id":"thedeoxen/refcontrol-flux-kontext-dataset","author":"thedeoxen","disabled":false,"gated":false,"lastModified":"2025-09-01T09:40:26.000Z","likes":11,"trendingScore":4,"private":false,"sha":"d0071dc0d5a7f687650a92bb5997dc6d84f1eaf8","description":"\n\t\n\t\t\n\t\tFlux Kontext RefControl Dataset\n\t\n\nThis dataset was created for training Flux Kontext RefControl LoRAs.It provides paired data of control maps (depth, pose, lineart, canny) and their corresponding results for reference-guided training.\n\n\n\t\n\t\t\n\t\t📂 Dataset Structure\n\t\n\ndataset/\n│\n├── depth/\n│   ├── control/   # depth maps\n│   └── result/    # corresponding images\n│\n├── pose/\n│   ├── control/   # pose skeletons / keypoints\n│   └── result/    # corresponding images\n│\n├── lineart/\n│   ├──… See the full description on the dataset page: https://huggingface.co/datasets/thedeoxen/refcontrol-flux-kontext-dataset.","downloads":51,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us"],"createdAt":"2025-08-31T10:43:37.000Z","key":""},{"_id":"699d836b2b8317e9175e662d","id":"Forithmus/MR-RATE","author":"Forithmus","disabled":false,"gated":"auto","lastModified":"2026-04-23T16:39:07.000Z","likes":86,"trendingScore":4,"private":false,"sha":"ce3eaa38c93e8ef0520d226be7ad21a22efb78a5","description":"\n  \n    MR-RATE: A Vision-Language Foundation Model and Dataset for Magnetic Resonance Imaging\n  \n\n\n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n  \n    \n    \n  \n\n\nWelcome to the official page for MR-RATE, a pioneering vision-language model and 3D medical imaging dataset that pairs textual reports with brain and spine MRI volumes. Following the approach of CT-RATE, the first 3D medical imaging dataset to pair images with textual reports, MR-RATE offers brain and spine MRI volumes matched with… See the full description on the dataset page: https://huggingface.co/datasets/Forithmus/MR-RATE.","downloads":91750,"tags":["task_categories:image-to-text","task_categories:text-to-image","task_categories:image-classification","task_categories:question-answering","task_categories:visual-question-answering","task_categories:zero-shot-classification","language:en","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","region:us","brain-mri","radiology","science","huggingscience","3d-medical-imaging","medical","mr-rate","multimodal","vision-language","healthcare","diagnostic-imaging","computer-vision","foundation-model"],"createdAt":"2026-02-24T10:54:35.000Z","key":""},{"_id":"69ada35be33c0fe7d096f084","id":"nvidia/Nemotron-SFT-Agentic-v2","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-03-11T00:58:06.000Z","likes":38,"trendingScore":4,"private":false,"sha":"49e79a3be5ab8cf7511a12958b95cfd6408cd8db","description":"\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThe Nemotron-SFT-Agentic-v2 dataset is a collection of synthetic single-turn and multi-turn tool-use trajectories designed to strengthen models’ capabilities as interactive, tool-using agents. It targets tasks where the model must decompose user goals, decide when to call tools, and reason over tool outputs to complete tasks reliably and safely.\nThis dataset is ready for commercial use.\nThe dataset consolidates three internally curated components (described below… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-SFT-Agentic-v2.","downloads":17071,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","license:apache-2.0","license:mit","region:us","tool-use"],"createdAt":"2026-03-08T16:27:07.000Z","key":""},{"_id":"69b0a69caab02f7aaec0e66f","id":"bones-studio/seed","author":"bones-studio","disabled":false,"gated":"auto","lastModified":"2026-05-03T15:03:12.000Z","likes":160,"trendingScore":4,"private":false,"sha":"2f59b2077b9da34dd4e43618e705c7cb962c9a66","description":"\n\n\n\t\n\t\t\n\t\tBONES-SEED: Skeletal Everyday Embodiment Dataset\n\t\n\nBONES-SEED is an open dataset of 142,220 annotated human motion animations for humanoid robotics. It provides motion capture data in SOMA and Unitree G1 formats, with natural language descriptions, temporal segmentation, and detailed skeletal metadata.\n\nProject website: bones.studio/datasets/seed\nInteractive viewer: seed-viewer.bones.studio\nAssociated code: github.com/bones-studio/seed-viewer\n\n\n\t\n\t\t\n\n\n\n\n\t\t\nTotal motions142,220 (71… See the full description on the dataset page: https://huggingface.co/datasets/bones-studio/seed.","downloads":4199,"tags":["task_categories:robotics","task_categories:text-to-video","task_categories:video-text-to-text","language:en","license:other","size_categories:100K<n<1M","region:us","motion-capture","humanoid-robotics","human-motion","physical-ai","whole-body-control","NVIDIA-SOMA","Unitree-G1","BVH","MuJoCo","language-to-action","locomotion","gesture","dance","object-interaction","multimodal","annotated"],"createdAt":"2026-03-10T23:17:48.000Z","key":""},{"_id":"69f8a99965865e88b8e7995b","id":"it4lia/PHANTOM","author":"it4lia","disabled":false,"gated":"auto","lastModified":"2026-06-24T08:25:55.000Z","likes":4,"trendingScore":4,"private":false,"sha":"756625342e13d836fa3c64bc67a382b454c86019","description":"\n\t\n\t\t\n\t\n\t\n\t\tPHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models\n\t\n\n\n  \n\n\n\nPHANTOM is a large-scale, open-source dataset of pre-generated multimodal adversarial attacks designed to evaluate the robustness and safety of Vision-Language Models (VLMs).\nThe dataset provides ready-to-use adversarial image–text pairs targeting a wide range of harmful intents, enabling systematic, reproducible, and cost-effective safety evaluation without requiring expensive… See the full description on the dataset page: https://huggingface.co/datasets/it4lia/PHANTOM.","downloads":246,"tags":["task_categories:image-text-to-text","task_categories:visual-question-answering","task_categories:image-to-text","annotations_creators:machine-generated","language_creators:machine-generated","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us","adversarial-attacks","adversarial-robustness","jailbreak","red-teaming","ai-safety","vision-language-models","multimodal","safety-evaluation","benchmark"],"createdAt":"2026-05-04T14:13:45.000Z","key":""},{"_id":"6a0eb43154ff1b9068f42571","id":"openbmb/UltraData-SFT-2605","author":"openbmb","disabled":false,"gated":"auto","lastModified":"2026-05-28T17:18:14.000Z","likes":355,"trendingScore":4,"private":false,"sha":"affda6aca75e7cff78e73f93ad08d4c3b01f097c","description":"\n\t\n\t\t\n\t\n\t\n\t\tUltraData-SFT-2605\n\t\n\n\n  \n\n\n\n📦 UltraData Collection |\n🌐 UltraData | \n🤗 MiniCPM5 Series\n\n\n\nEnglish |\n中文\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📚 Introduction\n\t\n\nUltraData-SFT-2605 is the full set of core-domain SFT data used in the post-training of MiniCPM5-1B-SFT within the MiniCPM5-1B series, and a key representative of L3 refined data in the UltraData L0-L4 tiered data management framework. It covers math, code, knowledge, instruction following, and other core domains, containing over 15 million Deep… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/UltraData-SFT-2605.","downloads":41959,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","language:zh","license:apache-2.0","size_categories:10M<n<100M","modality:text","arxiv:2602.09003","region:us","llm","sft","supervised-fine-tuning","post-training","deep-thinking","reasoning","instruction-following","math","code","knowledge","minicpm"],"createdAt":"2026-05-21T07:28:49.000Z","key":""},{"_id":"6a1cbd0141aa598ff9f9bf57","id":"HelioAI/Fable-5-Distill-Reasoning-462x","author":"HelioAI","disabled":false,"gated":false,"lastModified":"2026-06-15T22:35:42.000Z","likes":34,"trendingScore":4,"private":false,"sha":"ab4e69b74e7ef455f15f23fc60bac891db90a918","description":"\n  \n    \n      \n      HelioAI&nbsp;Labs\n      Mythos V2 Full Distill\n    \n    \n      DeepReason 462×105M\n    \n    \n      Unrestricted full-parameter distillation from Mythos V2 — complete reasoning traces with zero alignment truncation, engineered for deep analytical research and process supervision.\n    \n  \n  \n    \n      462\n      Examples\n    \n    \n      104.7M\n      Reasoning Chars\n    \n    \n      ≈26.35M\n      Est. Tokens\n    \n    \n      552K\n      Max Trace… See the full description on the dataset page: https://huggingface.co/datasets/HelioAI/Fable-5-Distill-Reasoning-462x.","downloads":1267,"tags":["task_categories:text-generation","annotations_creators:machine-generated","language:en","language:ru","license:unknown","size_categories:n<1K","region:us","reasoning","long-context","reasoning-traces","synthetic-data","chain-of-thought","process-supervision","mythos-v2","deep-reasoning","trace-analysis","sft","ai-reasoning","cybersecurity","biomedicine","full-weights","unrestricted-reasoning"],"createdAt":"2026-05-31T22:58:09.000Z","key":""},{"_id":"6a1df6c61a986dda4c007af9","id":"nvidia/Nemotron-3.5-Content-Safety-Dataset","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-05T15:26:25.000Z","likes":9,"trendingScore":4,"private":false,"sha":"841f5023b8db12b484180c6121bb10fcf400d1c3","description":"\n\t\n\t\t\n\t\n\t\n\t\tNemotron 3.5 Content Safety Dataset\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description:\n\t\n\nNemotron 3.5 Content Safety Dataset is a hybrid real/synthetic supervised instruction dataset for content-safety classification of human and assistant interactions. The dataset contains text-only and image-grounded single-turn conversations. Each example asks a classifier to determine user safety, response safety, and harmful categories; a subset also covers topic-following classification. Some training… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-3.5-Content-Safety-Dataset.","downloads":647,"tags":["size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-01T21:16:54.000Z","key":""},{"_id":"6a22fea456a6f32b21c1a5e7","id":"11-47/claude_opus_4.8_distill_5k","author":"11-47","disabled":false,"gated":false,"lastModified":"2026-06-05T16:52:03.000Z","likes":16,"trendingScore":4,"private":false,"sha":"ced1aa1f73cc5254fc27a56d298d9361296f7497","downloads":704,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-05T16:51:48.000Z","key":""},{"_id":"6a300991337974ac37c820cf","id":"build-small-hackathon/CVE_Vulnerailities_Detailed","author":"build-small-hackathon","disabled":false,"gated":false,"lastModified":"2026-06-15T14:18:13.000Z","likes":11,"trendingScore":4,"private":false,"sha":"728192687bda93fcc4eba79d20c5216d95f1770a","downloads":307,"tags":["size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-15T14:17:53.000Z","key":""},{"_id":"6a30ca2a8137fb18ce4932ce","id":"BAAI-Agents/SWITCH","author":"BAAI-Agents","disabled":false,"gated":false,"lastModified":"2026-06-18T07:19:48.000Z","likes":4,"trendingScore":4,"private":false,"sha":"62d7ebfd0bcf189596412d45d74e1195154dfc32","description":"\n\t\n\t\t\n\t\n\t\n\t\tSWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios\n\t\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nSWITCH (Semantic World Interface Tasks for Control & Handling) is a multimodal embodied-interaction benchmark for understanding, modeling, and evaluating actions over Tangible Control Interfaces (TCIs) in egocentric real-world scenarios.\nTCIs include everyday interfaces such as appliance panels, lighting controls, office machines, bathroom devices… See the full description on the dataset page: https://huggingface.co/datasets/BAAI-Agents/SWITCH.","downloads":1688,"tags":["task_categories:visual-question-answering","task_categories:text-generation","task_categories:video-to-video","task_categories:image-to-text","language:en","license:cc-by-nc-4.0","size_categories:1K<n<10K","format:json","modality:image","modality:text","modality:video","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","image","video","datasets","mlcroissant","embodied-ai","multimodal","egocentric","human-computer-interaction","world-modeling","tangible-control-interfaces","benchmark"],"createdAt":"2026-06-16T03:59:38.000Z","key":""},{"_id":"6a3154f1671ba44c169ee371","id":"google/WikiProfile","author":"google","disabled":false,"gated":false,"lastModified":"2026-06-19T09:58:02.000Z","likes":12,"trendingScore":4,"private":false,"sha":"0448b6abf3d6fb0a964e6935c0265dbd47584cda","description":"\n\t\n\t\t\n\t\n\t\n\t\tWikiProfile\n\t\n\nWikiProfile is a factual knowledge benchmark for evaluating how well language models encode and recall factual knowledge. It comprises 2,150 facts, each paired with 10 questions, for a total of 21,500 question instances.\nEach fact is grounded in the first paragraph (summary) of an English Wikipedia page and is defined as a proposition between two entities, a subject and an object (e.g., \"Oasis played their first gig at the Boardwalk club\" → subject: Oasis, object:… See the full description on the dataset page: https://huggingface.co/datasets/google/WikiProfile.","downloads":289,"tags":["task_categories:question-answering","language:en","license:cc-by-sa-4.0","size_categories:1K<n<10K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2602.14080","region:us","Factuality","Knowledge","Wikipedia","QA"],"createdAt":"2026-06-16T13:51:45.000Z","key":""},{"_id":"6a3163ddae94378f6df34aa0","id":"lodestones/booru-essence","author":"lodestones","disabled":false,"gated":false,"lastModified":"2026-06-18T01:01:43.000Z","likes":7,"trendingScore":4,"private":false,"sha":"61741056949df208159b64eff4fe72892f135c5a","description":"\n\t\n\t\t\n\t\n\t\n\t\tBooru Essence\n\t\n\nA highly condensed, diversity-maximized image dataset distilled from booru-style image boards.\nthe images can be obtained here Clybius/booru-essence-images\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nBooru Essence is a compact yet comprehensive dataset of ~41,000 images, curated through a maximum variance selection strategy. Rather than collecting images in bulk, each image was chosen to contribute unique tag coverage — ensuring that all 74,000+ booru tags have at least one… See the full description on the dataset page: https://huggingface.co/datasets/lodestones/booru-essence.","downloads":162,"tags":["license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-16T14:55:25.000Z","key":""},{"_id":"6a3bf903aac4154d09fc4c3b","id":"Voxel51/KITScenes-LongTail","author":"Voxel51","disabled":false,"gated":false,"lastModified":"2026-06-24T18:30:43.000Z","likes":4,"trendingScore":4,"private":false,"sha":"8e983bee36ebbc26c6d53655a7cf1ebeb8873a3c","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for KITScenes-LongTail Videos\n\t\n\n\nA FiftyOne grouped video dataset built from KIT-MRT/KITScenes-LongTail, a long-tail autonomous-driving benchmark for VLMs/VLAs. \nEach scenario is a ~9 s, six-camera surround-view event with a high-level driving instruction, multiple candidate future trajectories, and multilingual expert reasoning traces.\nThis is a FiftyOne dataset with 103 samples.\n\n\t\n\t\t\n\t\n\t\n\t\tInstallation\n\t\n\nIf you haven't already, install FiftyOne:\npip install -U… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/KITScenes-LongTail.","downloads":1196,"tags":["task_categories:video-classification","annotations_creators:expert-generated","language:en","language:es","language:zh","size_categories:n<1K","modality:3d","modality:video","library:datasets","library:mlcroissant","library:fiftyone","arxiv:2603.23607","region:us","fiftyone","group","video","autonomous-driving","end-to-end-driving","trajectory-prediction","motion-planning","reasoning","chain-of-thought","multimodal","multilingual","long-tail","embeddings","depth-estimation","3d-reconstruction","point-cloud"],"createdAt":"2026-06-24T15:34:27.000Z","key":""},{"_id":"6a3cbcb266f32f2d5c74aa07","id":"RicemanT/Anime-Background-Finetuning-V1.1","author":"RicemanT","disabled":false,"gated":false,"lastModified":"2026-06-26T03:35:03.000Z","likes":4,"trendingScore":4,"private":false,"sha":"d6bd50b339930d724b875ec18c61b2b3a945a3b3","description":"\n\t\n\t\t\n\t\n\t\n\t\tAnime-Background-Finetuning (10143 manually curated by hand images from danbooru and reddit collections)\n\t\n\nThe dataset contain roughly 2k of anime Screencap data and 8k of scrapped danbooru illustration data.\nThis is the proccessed version of the dataset meant to be used for my personal finetuning practice project, please visit my RicemanT/Background-Finetuning repo for the raw unprocessed data that you can process yourself.\nThe dataset have two minor type of processing being done… See the full description on the dataset page: https://huggingface.co/datasets/RicemanT/Anime-Background-Finetuning-V1.1.","downloads":5054,"tags":["language:en","license:apache-2.0","size_categories:10K<n<100K","modality:image","modality:text","region:us"],"createdAt":"2026-06-25T05:29:22.000Z","key":""},{"_id":"6a3ecb29a60d89da1921c75d","id":"yigitekin/BeyondMasks","author":"yigitekin","disabled":false,"gated":false,"lastModified":"2026-06-26T20:07:31.000Z","likes":4,"trendingScore":4,"private":false,"sha":"0774e43de9abc576e389c78a319b4a5ff3472fdf","description":"\n\t\n\t\t\n\t\n\t\n\t\tBeyondMasks\n\t\n\nDataset repository for BeyondMasks: Evaluating Causal and Physical Consistency in Video Object Removal, accepted to ECCV 2026.\nAuthors: Yiğit Ekin, Enes Sanli, Aykut Erdem, Erkut Erdem, Aysegul Dundar\nBeyondMasks evaluates whether video object removal methods remove not only the target object, but also its causal and physical aftereffects such as shadows, reflections, and light effects.\n\n\t\n\t\t\n\t\n\t\n\t\tRepository Contents\n\t\n\nBeyondMasks/\n├── masks/                #… See the full description on the dataset page: https://huggingface.co/datasets/yigitekin/BeyondMasks.","downloads":333,"tags":["license:cc-by-4.0","region:us"],"createdAt":"2026-06-26T18:55:37.000Z","key":""},{"_id":"6a413a341831ca8ca806cd45","id":"prathoshap/vagdhenu-data","author":"prathoshap","disabled":false,"gated":false,"lastModified":"2026-06-28T19:16:49.000Z","likes":4,"trendingScore":4,"private":false,"sha":"adda7747c1ea05c45e6b789d23d6b6bbf550918d","description":"\n\t\n\t\t\n\t\n\t\n\t\tVāgdhenu — Sanskrit Chant Corpus\n\t\n\nA single-speaker Sanskrit chant (pārāyaṇa) recording corpus — classical ślokas chanted with tradition-faithful prosody and metrically-aware durations. Training data behind the Vāgdhenu Sanskrit Chant TTS.\n~1,467 clips · ~5.3 hours · 24 kHz mono. One reciter (the author); classical śāstra chant, no Vedic svaras. Two configs (two cutting/normalization passes over the recording sessions, largely different verses):\n\n\t\n\t\t\nconfig\nclips\nhours\nsources… See the full description on the dataset page: https://huggingface.co/datasets/prathoshap/vagdhenu-data.","downloads":5,"tags":["task_categories:text-to-speech","language:sa","license:cc-by-4.0","size_categories:1K<n<10K","format:audiofolder","modality:audio","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2026-06-28T15:13:56.000Z","key":""},{"_id":"6a42c1990b85d2302a07c74a","id":"Voxel51/SceneFun3D","author":"Voxel51","disabled":false,"gated":false,"lastModified":"2026-06-29T19:44:43.000Z","likes":4,"trendingScore":4,"private":false,"sha":"76803371aae67277cfa0cc29804db2368c6756ce","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for SceneFun3D\n\t\n\n\nSceneFun3D is a 3D scene-understanding dataset of high-resolution Faro laser-scan point clouds of indoor environments, densely annotated with fine-grained\nfunctional interactive elements (handles, knobs, buttons, switches, ...), their affordances, motion parameters, and free-form task descriptions.\nEach scene is also captured by several iPad video sequences with RGB, depth, camera poses, and intrinsics.\nThis is the FiftyOne version of the dataset: a… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/SceneFun3D.","downloads":0,"tags":["task_categories:object-detection","annotations_creators:expert-generated","annotations_creators:machine-generated","language:en","license:cc-by-nc-sa-4.0","size_categories:n<1K","modality:video","modality:3d","library:fiftyone","region:us","fiftyone","3d","point-cloud","fo3d","group","video","rgbd","depth","affordance","functionality","indoor-scenes","robotics"],"createdAt":"2026-06-29T19:03:53.000Z","key":""},{"_id":"621ffdd236468d709f181fce","id":"karpathy/tiny_shakespeare","author":"karpathy","disabled":false,"gated":false,"lastModified":"2024-01-18T11:17:14.000Z","likes":85,"trendingScore":3,"private":false,"sha":"c7a7ff3e41cda4f190ec575d180e764eb7f7f4ba","citation":"@misc{\n  author={Karpathy, Andrej},\n  title={char-rnn},\n  year={2015},\n  howpublished={\\\\url{https://github.com/karpathy/char-rnn}}\n}","description":"40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.\n\nTo use for e.g. character modelling:\n\n```\nd = datasets.load_dataset(name='tiny_shakespeare')['train']\nd = d.map(lambda x: datasets.Value('strings').unicode_split(x['text'], 'UTF-8'))\n# train split includes vocabulary for other splits\nvocabulary = sorted(set(next(iter(d)).numpy()))\nd = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]})\nd = d.unbatch()\nseq_len = 100\nbatch_size = 2\nd = d.batch(seq_len)\nd = d.batch(batch_size)\n```","downloads":6440,"tags":["region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f18200d","id":"Salesforce/wikitext","author":"Salesforce","disabled":false,"gated":false,"lastModified":"2024-01-04T16:49:18.000Z","likes":727,"trendingScore":3,"private":false,"sha":"b08601e04326c79dfdd32d625aee71d232d685c3","description":"\n\t\n\t\t\n\t\tDataset Card for \"wikitext\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\n The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\nCompared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over\n110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.","downloads":1317628,"paperswithcode_id":"wikitext-2","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-sa-3.0","license:gfdl","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:1609.07843","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f183929","id":"codeparrot/github-code","author":"codeparrot","disabled":false,"gated":false,"lastModified":"2022-10-20T15:01:14.000Z","likes":367,"trendingScore":3,"private":false,"sha":"b5661e6b17396364b2bcf8e68977b0d28e1ebd19","description":"The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.","downloads":1210198,"tags":["task_categories:text-generation","task_ids:language-modeling","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","language:code","license:other","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f184284","id":"wikimedia/wikipedia","author":"wikimedia","disabled":false,"gated":false,"lastModified":"2024-01-09T09:40:51.000Z","likes":1256,"trendingScore":3,"private":false,"sha":"b04c8d1ceb2f5cd4588862100d08de323dccfbaa","description":"\n\t\n\t\t\n\t\tDataset Card for Wikimedia Wikipedia\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nWikipedia dataset containing cleaned articles of all languages.\nThe dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/)\nwith one subset per language, each containing a single train split.\nEach example contains the content of one full Wikipedia article with cleaning to strip\nmarkdown and unwanted sections (references, etc.).\nAll language subsets have already been processed for recent dump, and you… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/wikipedia.","downloads":174513,"tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","language:ab","language:ace","language:ady","language:af","language:alt","language:am","language:ami","language:an","language:ang","language:anp","language:ar","language:arc","language:ary","language:arz","language:as","language:ast","language:atj","language:av","language:avk","language:awa","language:ay","language:az","language:azb","language:ba","language:ban","language:bar","language:bbc","language:bcl","language:be","language:bg","language:bh","language:bi","language:bjn","language:blk","language:bm","language:bn","language:bo","language:bpy","language:br","language:bs","language:bug","language:bxr","language:ca","language:cbk","language:cdo","language:ce","language:ceb","language:ch","language:chr","language:chy","language:ckb","language:co","language:cr","language:crh","language:cs","language:csb","language:cu","language:cv","language:cy","language:da","language:dag","language:de","language:dga","language:din","language:diq","language:dsb","language:dty","language:dv","language:dz","language:ee","language:el","language:eml","language:en","language:eo","language:es","language:et","language:eu","language:ext","language:fa","language:fat","language:ff","language:fi","language:fj","language:fo","language:fon","language:fr","language:frp","language:frr","language:fur","language:fy","language:ga","language:gag","language:gan","language:gcr","language:gd","language:gl","language:glk","language:gn","language:gom","language:gor","language:got","language:gpe","language:gsw","language:gu","language:guc","language:gur","language:guw","language:gv","language:ha","language:hak","language:haw","language:hbs","language:he","language:hi","language:hif","language:hr","language:hsb","language:ht","language:hu","language:hy","language:hyw","language:ia","language:id","language:ie","language:ig","language:ik","language:ilo","language:inh","language:io","language:is","language:it","language:iu","language:ja","language:jam","language:jbo","language:jv","language:ka","language:kaa","language:kab","language:kbd","language:kbp","language:kcg","language:kg","language:ki","language:kk","language:kl","language:km","language:kn","language:ko","language:koi","language:krc","language:ks","language:ksh","language:ku","language:kv","language:kw","language:ky","language:la","language:lad","language:lb","language:lbe","language:lez","language:lfn","language:lg","language:li","language:lij","language:lld","language:lmo","language:ln","language:lo","language:lt","language:ltg","language:lv","language:lzh","language:mad","language:mai","language:map","language:mdf","language:mg","language:mhr","language:mi","language:min","language:mk","language:ml","language:mn","language:mni","language:mnw","language:mr","language:mrj","language:ms","language:mt","language:mwl","language:my","language:myv","language:mzn","language:nah","language:nan","language:nap","language:nds","language:ne","language:new","language:nia","language:nl","language:nn","language:no","language:nov","language:nqo","language:nrf","language:nso","language:nv","language:ny","language:oc","language:olo","language:om","language:or","language:os","language:pa","language:pag","language:pam","language:pap","language:pcd","language:pcm","language:pdc","language:pfl","language:pi","language:pih","language:pl","language:pms","language:pnb","language:pnt","language:ps","language:pt","language:pwn","language:qu","language:rm","language:rmy","language:rn","language:ro","language:ru","language:rue","language:rup","language:rw","language:sa","language:sah","language:sat","language:sc","language:scn","language:sco","language:sd","language:se","language:sg","language:sgs","language:shi","language:shn","language:si","language:sk","language:skr","language:sl","language:sm","language:smn","language:sn","language:so","language:sq","language:sr","language:srn","language:ss","language:st","language:stq","language:su","language:sv","language:sw","language:szl","language:szy","language:ta","language:tay","language:tcy","language:te","language:tet","language:tg","language:th","language:ti","language:tk","language:tl","language:tly","language:tn","language:to","language:tpi","language:tr","language:trv","language:ts","language:tt","language:tum","language:tw","language:ty","language:tyv","language:udm","language:ug","language:uk","language:ur","language:uz","language:ve","language:vec","language:vep","language:vi","language:vls","language:vo","language:vro","language:wa","language:war","language:wo","language:wuu","language:xal","language:xh","language:xmf","language:yi","language:yo","language:yue","language:za","language:zea","language:zgh","language:zh","language:zu","license:cc-by-sa-3.0","license:gfdl","size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"627007d3becab9e2dcf15a40","id":"ILSVRC/imagenet-1k","author":"ILSVRC","disabled":false,"gated":"auto","lastModified":"2025-09-17T04:58:55.000Z","likes":844,"trendingScore":3,"private":false,"sha":"49e2ee26f3810fb5a7536bbf732a7b07389a47b5","description":"\n\t\n\t\t\n\t\tDataset Card for ImageNet\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a \"synonym set\" or \"synset\". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are… See the full description on the dataset page: https://huggingface.co/datasets/ILSVRC/imagenet-1k.","downloads":105519,"paperswithcode_id":"imagenet-1k-1","tags":["task_categories:image-classification","task_ids:multi-class-image-classification","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:other","size_categories:1M<n<10M","format:parquet","format:optimized-parquet","modality:image","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:1409.0575","arxiv:1912.07726","arxiv:1811.12231","arxiv:2109.13228","region:us"],"createdAt":"2022-05-02T16:33:23.000Z","key":""},{"_id":"62b4794beb8d44abbfeedc8f","id":"CShorten/ML-ArXiv-Papers","author":"CShorten","disabled":false,"gated":false,"lastModified":"2022-06-27T12:15:11.000Z","likes":68,"trendingScore":3,"private":false,"sha":"c878972daa0a5ec5f0d684354b6c8018f27d1316","description":"This dataset contains the subset of ArXiv papers with the \"cs.LG\" tag to indicate the paper is about Machine Learning.\nThe core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering.\nThe dataset is maintained by with requests to the ArXiv API.\nThe current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.","downloads":4329,"tags":["license:afl-3.0","size_categories:100K<n<1M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-06-23T14:31:39.000Z","key":""},{"_id":"641debae1d05404efd046a4f","id":"yahma/alpaca-cleaned","author":"yahma","disabled":false,"gated":false,"lastModified":"2023-04-10T20:29:06.000Z","likes":847,"trendingScore":3,"private":false,"sha":"12567cabf869d7c92e573c7c783905fc160e9639","description":"\n\t\n\t\t\n\t\tDataset Card for Alpaca-Cleaned\n\t\n\n\nRepository: https://github.com/gururise/AlpacaDataCleaned\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThis is a cleaned version of the original Alpaca Dataset released by Stanford. The following issues have been identified in the original release and fixed in this dataset:\n\nHallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.\n\n\"instruction\":\"Summarize the… See the full description on the dataset page: https://huggingface.co/datasets/yahma/alpaca-cleaned.","downloads":24809,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","instruction-finetuning"],"createdAt":"2023-03-24T18:27:58.000Z","key":""},{"_id":"64b39b9aee7a5f1825fdf662","id":"nampdn-ai/tiny-codes","author":"nampdn-ai","disabled":false,"gated":"auto","lastModified":"2023-09-30T04:14:36.000Z","likes":292,"trendingScore":3,"private":false,"sha":"9aebe5ee8b406356d5f5f2d603bc0a1684ee8ce7","description":"\n\t\n\t\t\n\t\tReasoning with Language and Code\n\t\n\nThis synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.","downloads":1195,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2306.11644","arxiv:2305.07759","doi:10.57967/hf/0937","region:us"],"createdAt":"2023-07-16T07:26:18.000Z","key":""},{"_id":"650a9248d26103b6eee3ea7b","id":"lmsys/lmsys-chat-1m","author":"lmsys","disabled":false,"gated":"auto","lastModified":"2024-07-27T09:28:42.000Z","likes":925,"trendingScore":3,"private":false,"sha":"200748d9d3cddcc9d782887541057aca0b18c5da","description":"\n\t\n\t\t\n\t\tLMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset\n\t\n\nThis dataset contains one million real-world conversations with 25 state-of-the-art LLMs.\nIt is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023.\nEach sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag.\nUser consent is obtained through the \"Terms of use\"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.","downloads":6904,"tags":["size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2309.11998","region:us"],"createdAt":"2023-09-20T06:33:44.000Z","key":""},{"_id":"6611aa7e66d71ef25e7355a7","id":"espnet/yodas2","author":"espnet","disabled":false,"gated":false,"lastModified":"2025-05-15T22:28:55.000Z","likes":56,"trendingScore":3,"private":false,"sha":"c9674490249665d658f527e2684848377108d82c","description":"YODAS2 is the long-form dataset from YODAS dataset.\nIt provides the same dataset as espnet/yodas  but YODAS2 has the following new features:\n\nformatted in the long-form (video-level) where audios are not segmented.\naudios are encoded using higher sampling rates (i.e. 24k)\n\nFor detailed information about YODAS dataset, please refer to our paper and the espnet/yodas repo.\n\n\t\n\t\t\n\t\n\t\n\t\tUsage:\n\t\n\nEach data point corresponds to an entire video on YouTube, it contains the following fields:\n\nvideo_id:… See the full description on the dataset page: https://huggingface.co/datasets/espnet/yodas2.","downloads":60318,"tags":["license:cc-by-3.0","arxiv:2406.00899","region:us"],"createdAt":"2024-04-06T20:03:10.000Z","key":""},{"_id":"663b7fd5a4152b77b637ba11","id":"TIGER-Lab/MMLU-Pro","author":"TIGER-Lab","disabled":false,"gated":false,"lastModified":"2026-05-02T06:26:05.000Z","likes":490,"trendingScore":3,"private":false,"sha":"b189ec765aa7ed75c8acfea42df31fdae71f97be","description":"\n\t\n\t\t\n\t\tMMLU-Pro Dataset\n\t\n\nMMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. \n|Github | 🏆Leaderboard | 📖Paper |\n\n\t\n\t\t\n\t\n\t\n\t\t🚀 What's New\n\t\n\n\n[2026.03.11] Added more cutting-edge frontier models to the leaderboard, including the Claude-4.6 series, Seed2.0 series, Qwen3.5 series, and Gemini-3.1-Pro, among… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.","downloads":156460,"tags":["benchmark:official","benchmark:eval-yaml","task_categories:question-answering","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2406.01574","doi:10.57967/hf/2439","region:us","evaluation"],"createdAt":"2024-05-08T13:36:21.000Z","key":""},{"_id":"6660dbf108494e3899d4acb0","id":"galileo-ai/ragbench","author":"galileo-ai","disabled":false,"gated":false,"lastModified":"2024-06-11T22:05:30.000Z","likes":118,"trendingScore":3,"private":false,"sha":"97808f3e5fd16ede40bbff6c2949af8139b2eb7b","description":"\n\t\n\t\t\n\t\tRAGBench\n\t\n\n\n\t\n\t\t\n\t\tDataset Overview\n\t\n\nRAGBEnch is a large-scale RAG benchmark dataset of 100k RAG examples.\nIt covers five unique industry-specific domains and various RAG task types.\nRAGBench examples are sourced from industry corpora such as user manuals, making it particularly relevant for industry applications.\nRAGBench comrises 12 sub-component datasets, each one split into train/validation/test splits\n\n\t\n\t\t\n\t\tUsage\n\t\n\nfrom datasets import load_dataset\n\n# load… See the full description on the dataset page: https://huggingface.co/datasets/galileo-ai/ragbench.","downloads":6205,"tags":["license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-05T21:43:13.000Z","key":""},{"_id":"66677de8998c6a8b60bc66c8","id":"justinwangx/CTFtime","author":"justinwangx","disabled":false,"gated":false,"lastModified":"2024-06-12T19:53:06.000Z","likes":5,"trendingScore":3,"private":false,"sha":"041a7744e2716c6a1c59f7bc66a23c32973b1d82","downloads":135,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-10T22:27:52.000Z","key":""},{"_id":"66878327793ca225e0cb60f2","id":"naijavoices/naijavoices-dataset","author":"naijavoices","disabled":false,"gated":"auto","lastModified":"2025-08-07T00:31:48.000Z","likes":26,"trendingScore":3,"private":false,"sha":"99453545216e72b090cdf48a9fc3d6269bcd7967","description":"\nImportant Information: please be aware that this version of the dataset is huge (500+ GB) and can therefore be challenging to use. To alleviate this and facilitate adoption, we’ve provided a compressed version (84GB) here. We suggest using that instead if you have compute/storage constraints. They are both the exact same data.\n\n\n\t\n\t\t\n\t\tIntroduction\n\t\n\nWelcome to the NaijaVoices dataset. The NaijaVoices dataset consists of 1,800 hours of authentic speech (from over 5,000 diverse speakers!) and… See the full description on the dataset page: https://huggingface.co/datasets/naijavoices/naijavoices-dataset.","downloads":1264,"tags":["license:cc-by-nc-sa-4.0","size_categories:1M<n<10M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2505.20564","doi:10.57967/hf/3257","region:us"],"createdAt":"2024-07-05T05:22:47.000Z","key":""},{"_id":"66cf1ca0103ac4798609f32a","id":"CaptionEmporium/flickr-megalith-10m-internvl2-multi-caption","author":"CaptionEmporium","disabled":false,"gated":false,"lastModified":"2024-08-28T12:54:11.000Z","likes":31,"trendingScore":3,"private":false,"sha":"75b33ce72533023bf907f8a0bf160099883f1bae","description":"\n\t\n\t\t\n\t\tDataset Card for flickr-megalith-10m-internvl2-multi-caption\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis is approximately 57.3 million synthetic captions for the images found in madebyollin/megalith-10m. \nIt includes the following captions:\n\nInternVL2 8B long captions (by CaptionEmporium)\nInternVL2 8B short captions (by CaptionEmporium)\nFlorence2 long captions (by aipicasso)\nFlorence2 short captions (by CaptionEmporium)\nShareCaptioner long captions (by drawthingsai)\nShareCaptioner short… See the full description on the dataset page: https://huggingface.co/datasets/CaptionEmporium/flickr-megalith-10m-internvl2-multi-caption.","downloads":272,"tags":["task_categories:text-to-image","task_categories:image-to-text","task_categories:other","language:en","license:cc-by-sa-4.0","size_categories:1M<n<10M","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","image-text-dataset","synthetic-dataset","InternVL2","InternVL2-8b","synthetic-captions","flickr","megalith"],"createdAt":"2024-08-28T12:48:32.000Z","key":""},{"_id":"66ec12a23843d7d5b57a71c5","id":"Lichess/chess-puzzles","author":"Lichess","disabled":false,"gated":false,"lastModified":"2026-06-11T16:11:32.000Z","likes":32,"trendingScore":3,"private":false,"sha":"4305c92e026ecccde805f6b625d8354b4650f834","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for Lichess Puzzles\n\t\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\n6,014,381 puzzles, rated and tagged. See them in action on Lichess. \nThis dataset is updated monthly, and was last updated on Jun 11th, 2026.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Creation\n\t\n\nGenerating the initial dataset chess puzzles took more than 50 years of CPU time. We went through 300,000,000 analyzed games from the Lichess database, and re-analyzed interesting positions with Stockfish 12/13/14/15 NNUE at 40 meganodes.… See the full description on the dataset page: https://huggingface.co/datasets/Lichess/chess-puzzles.","downloads":4189,"tags":["license:cc0-1.0","size_categories:1M<n<10M","format:parquet","format:optimized-parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","chess","lichess","puzzles"],"createdAt":"2024-09-19T12:01:38.000Z","key":""},{"_id":"6714c4993edd22e448010822","id":"laion/LAION-DISCO-12M","author":"laion","disabled":false,"gated":false,"lastModified":"2026-06-21T14:16:49.000Z","likes":51,"trendingScore":3,"private":false,"sha":"6e7bf3758a77301e46a715af894fefd79bb1da53","description":"The LAION-DISCO-12M dataset contains 12M links to music on YouTube, inspired by the methodology of DISCO-10M. It contains song metadata (song_id, title, artist_names, artist_ids, album_name, album_id, isExplicit, views, duration) and YouTube URL, pointing to the original song on the public web. It does not contain any original audio samples and is thus an index dataset.\nStarting from an initial seed list of artists, we can discover new artists by recursively exploring the artists listed in the… See the full description on the dataset page: https://huggingface.co/datasets/laion/LAION-DISCO-12M.","downloads":440,"tags":["license:apache-2.0","size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2306.13512","region:us","music"],"createdAt":"2024-10-20T08:51:37.000Z","key":""},{"_id":"674d19e2548e472d0e95089b","id":"opendatalab/OmniDocBench","author":"opendatalab","disabled":false,"gated":false,"lastModified":"2026-06-26T10:19:31.000Z","likes":94,"trendingScore":3,"private":false,"sha":"aa1ee96d106dbe53d0ae59474d75c6e6d9b53fec","description":"\n\t\n\t\t\n\t\n\t\n\t\tOmniDocBench\n\t\n\nEnglish | 简体中文\nOmniDocBench is an evaluation dataset for diverse document parsing in real-world scenarios, with the following characteristics:\n\nDiverse Document Types: The evaluation set contains 1651 PDF pages, covering 10 document types, 5 layout types and 5 language types. Coverage includes academic literature, research and financial reports, newspapers, textbooks, exam papers, magazines, handwritten notes, historical documents, and more.\nRich Annotations:… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/OmniDocBench.","downloads":17890,"tags":["size_categories:1K<n<10K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","arxiv:2412.07626","region:us"],"createdAt":"2024-12-02T02:22:26.000Z","key":""},{"_id":"678078c4e8281a79c2ae382c","id":"nvidia/Aegis-AI-Content-Safety-Dataset-2.0","author":"nvidia","disabled":false,"gated":false,"lastModified":"2025-06-09T19:15:56.000Z","likes":98,"trendingScore":3,"private":false,"sha":"d86bb8bedff51d25ac834ab7838f1cc61acb7a2c","description":"\n\t\n\t\t\n\t\t🛡️ Nemotron Content Safety Dataset V2\n\t\n\n\n\nThe Nemotron Content Safety Dataset V2, formerly known as Aegis AI Content Safety Dataset 2.0, is comprised of 33,416 annotated interactions between humans and LLMs, split into 30,007 training samples, 1,445 validation samples,  and 1,964 test samples. This release is an extension of the previously published Nemotron Content Safety Dataset V1. \nTo curate the dataset, we use the HuggingFace version of human preference data about harmlessness… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0.","downloads":3763,"tags":["task_categories:text-classification","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","safety","content moderation","LLM safety","toxicity detection","nemoguard","aegis","nemotron"],"createdAt":"2025-01-10T01:32:52.000Z","key":""},{"_id":"67b307928a1b0f0b48cd7cfe","id":"epfml/FineWeb2-HQ","author":"epfml","disabled":false,"gated":false,"lastModified":"2025-02-19T21:39:01.000Z","likes":71,"trendingScore":3,"private":false,"sha":"c0c06e94fd3a44ae9e802b2b0fc533817601eb5e","description":"\n\t\n\t\t\n\t\tFineWeb2-HQ\n\t\n\n\n\t\n\t\t\n\t\tDataset summary\n\t\n\nFineWeb2-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb2, spanning 20 languages. It enables around 6x faster pretraining compared to the base dataset. FineWeb2-HQ was created by selecting the top 10% quality documents of FineWeb2 in each language, based on scores assigned by a deep learning classifier trained to identify structured and knowledge-rich samples using XLM-RoBERTa embeddings.\n\n  \n\n\nValidation… See the full description on the dataset page: https://huggingface.co/datasets/epfml/FineWeb2-HQ.","downloads":20957,"tags":["task_categories:text-generation","language:ru","language:zh","language:de","language:ja","language:es","language:fr","language:it","language:pt","language:pl","language:nl","language:id","language:tr","language:cs","language:vi","language:sv","language:fa","language:ar","language:el","language:da","language:hu","license:odc-by","size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2502.10361","region:us"],"createdAt":"2025-02-17T09:55:30.000Z","key":""},{"_id":"67b6e7221a0bf9e8a70c385e","id":"m-a-p/SuperGPQA","author":"m-a-p","disabled":false,"gated":false,"lastModified":"2025-04-30T15:15:21.000Z","likes":92,"trendingScore":3,"private":false,"sha":"4430d4458112c7d4497fdcf94d7cc223313d6acf","description":"This repository contains the data presented in SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines.\n\n\t\n\t\t\n\t\tTutorials for submitting to the official leadboard\n\t\n\ncoming soon\n\n\t\n\t\t\n\t\t📜 License\n\t\n\nSuperGPQA is a composite dataset that includes both original content and portions of data derived from other sources. The dataset is made available under the Open Data Commons Attribution License (ODC-BY), which asserts no copyright over the underlying content. \nThis means that while the… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SuperGPQA.","downloads":8031,"tags":["language:en","license:odc-by","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2502.14739","region:us"],"createdAt":"2025-02-20T08:26:10.000Z","key":""},{"_id":"67f62a2014693c9adb400fa5","id":"nvidia/OpenCodeInstruct","author":"nvidia","disabled":false,"gated":false,"lastModified":"2025-04-28T19:08:02.000Z","likes":101,"trendingScore":3,"private":false,"sha":"8f3ba5bafe4d6e8db46082cf7ae6741bc370604d","description":"\n\t\n\t\t\n\t\tOpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nWe introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. OpenCodeInstruct is designed for supervised fine-tuning (SFT).\n\nTechnical Report - Discover the methodology and technical details behind OpenCodeInstruct.\nGithub Repo - Access the complete pipeline used to perform SFT.\n\nThis dataset is ready for… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenCodeInstruct.","downloads":9712,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2504.04030","region:us","code","synthetic"],"createdAt":"2025-04-09T08:04:48.000Z","key":""},{"_id":"68072cc4cce05035af98207e","id":"nvidia/OpenMathReasoning","author":"nvidia","disabled":false,"gated":false,"lastModified":"2025-05-27T18:43:44.000Z","likes":467,"trendingScore":3,"private":false,"sha":"d3d08664755704f422af97d43a7ff0ded4bd95df","description":"\n\t\n\t\t\n\t\tOpenMathReasoning\n\t\n\nOpenMathReasoning is a large-scale math reasoning dataset for training large language models (LLMs). \nThis dataset contains \n\n306K unique mathematical problems sourced from AoPS forums with: \n3.2M long chain-of-thought (CoT) solutions\n1.7M long tool-integrated reasoning (TIR) solutions\n566K samples that select the most promising solution out of many candidates (GenSelect)\n\n\nAdditional 193K problems sourced from AoPS forums (problems only, no solutions)\n\nWe used… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMathReasoning.","downloads":18453,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2504.16891","region:us","math","nvidia"],"createdAt":"2025-04-22T05:44:36.000Z","key":""},{"_id":"6841fee647554eb6e0b7203d","id":"nvidia/PhysicalAI-Autonomous-Vehicles-NuRec","author":"nvidia","disabled":false,"gated":"auto","lastModified":"2026-06-25T21:22:11.000Z","likes":189,"trendingScore":3,"private":false,"sha":"9ef2e10b47312b6af9c88d9663e882dd2aa54341","description":"\n\t\n\t\t\n\t\n\t\n\t\ttask_categories:\n- robotics\ntags:\n- physicalAI\n\t\n\nFind the 1500+ scenes in the sample_set/26.04_release folder.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description:\n\t\n\nNeural reconstructed dataset that carries 3D reconstructed driving scenes. The scenes are about 20 second long and stored in form of usdz files, along with respective xodr map files, surface mesh. The reconstructions were generated using 6 camera views (front-wide 120 deg, front-tele 30 deg, cross right/left 120 deg and rear right/left… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec.","downloads":19489,"tags":["license:other","region:us"],"createdAt":"2025-06-05T20:32:38.000Z","key":""},{"_id":"68465f1ba516bd14fc146e1f","id":"nvidia/Nemotron-Personas-USA","author":"nvidia","disabled":false,"gated":false,"lastModified":"2025-12-16T19:13:23.000Z","likes":328,"trendingScore":3,"private":false,"sha":"5b4cd35ab46490c1da1bd2b5a2324d6f871be180","description":"\n\t\n\t\t\n\t\tNemotron-Personas-USA\n\t\n\n\n  \n  A compound AI approach to personas grounded in real-world distributions\n\n\n\n\t\n\t\t\n\t\tv1.1 Update\n\t\n\nThe v1.1 update introduces the following changes:\n\nleverage openai/gpt-oss-120b model instead of mistralai/Mixtral-8x22B-v0.1 model to improve data quality and diversity\nincrease the number of records from 100k to 1M, for a total of 0.94B tokens \nupdate the dataset name to Nemotron-Personas-USA in order to differentiate it from other region-specific datasets… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA.","downloads":12447,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","library:datadesigner","region:us","synthetic","personas","NVIDIA","datadesigner"],"createdAt":"2025-06-09T04:12:11.000Z","key":""},{"_id":"688279007069d9a83ab3a68b","id":"rajpurkarlab/ReXGroundingCT","author":"rajpurkarlab","disabled":false,"gated":"auto","lastModified":"2026-06-13T20:24:13.000Z","likes":42,"trendingScore":3,"private":false,"sha":"8e8227cf5271cb4b3821e513626bb05a4d5cd60a","description":"\n\t\n\t\t\n\t\n\t\n\t\tReXGroundingCT\n\t\n\nReXGroundingCT is a dataset designed to link free-text radiology findings with pixel-level segmentations in 3D chest CT scans. Each sample consists of a volumetric CT scan, associated segmentation masks for one or more findings, and detailed textual descriptions.The dataset has segmentations for 8,028 findings across 14 different categories in 3,142 CT scans. There are 2,992 scans allocated for training, 50 for public validation, and 100 held privately to be… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.","downloads":2491,"tags":["license:cc-by-nc-sa-4.0","arxiv:2507.22030","arxiv:2403.17834","region:us"],"createdAt":"2025-07-24T18:18:40.000Z","key":""},{"_id":"68898207609ab65154f40ee8","id":"deepreinforce-ai/CUDA-L1","author":"deepreinforce-ai","disabled":false,"gated":false,"lastModified":"2025-08-12T22:58:42.000Z","likes":4,"trendingScore":3,"private":false,"sha":"b2dcc0a23737fc1c02fb4acba49d61b96d315ef7","description":"\n  \n      \n  \n\n\n\n\n\n\n     |     🏠  Project Page     |     📄  Paper     |     🔥  Demo\n\n\n\n\n\n  \n      \n\t\n\t\t\n\t\t🥳 Introduction\n\t\n\nIn this paper, we introduce CUDA-L1, an automated reinforcement learning (RL) framework for CUDA optimization. The core of CUDA-L1 is a contrastive RL model, a newly-designed RL system to enhance optimization through comparative learning. \n\n  \n      \n  \n\n \n    Fig：Average speedup across different architectures on KernelBench over baselines.\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t🗒️ To-do… See the full description on the dataset page: https://huggingface.co/datasets/deepreinforce-ai/CUDA-L1.","downloads":258,"tags":["task_categories:text-generation","language:en","license:gpl-3.0","size_categories:1K<n<10K","arxiv:2507.14111","region:us"],"createdAt":"2025-07-30T02:23:03.000Z","key":""},{"_id":"688cefb03ba8a74369a84b4c","id":"amphion/Emilia-NV","author":"amphion","disabled":false,"gated":"auto","lastModified":"2025-09-18T06:19:01.000Z","likes":48,"trendingScore":3,"private":false,"sha":"cdd8b7a3f6ba142d6eea95a8e5ffe5b63034eb6e","description":"\n\t\n\t\t\n\t\tNVSpeech Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe NVSpeech dataset provides extensive annotations of paralinguistic vocalizations for Mandarin Chinese speech, aimed at enhancing the capabilities of automatic speech recognition (ASR) and text-to-speech (TTS) systems. The dataset features explicit word-level annotations for 18 categories of paralinguistic vocalizations, including non-verbal sounds like laughter and breathing, as well as lexicalized interjections like \"uhm\" and \"oh.\"… See the full description on the dataset page: https://huggingface.co/datasets/amphion/Emilia-NV.","downloads":560,"tags":["task_categories:text-to-speech","task_categories:automatic-speech-recognition","language:zh","license:cc-by-nc-sa-4.0","size_categories:100K<n<1M","format:webdataset","modality:audio","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us"],"createdAt":"2025-08-01T16:47:44.000Z","key":""},{"_id":"68da4fc8d9ddaa394ac167ef","id":"nick007x/arxiv-papers","author":"nick007x","disabled":false,"gated":false,"lastModified":"2026-04-01T05:09:46.000Z","likes":202,"trendingScore":3,"private":false,"sha":"daafd8c625b8f8ec6d5515a91392a989d2674dd2","downloads":16010,"tags":["size_categories:1M<n<10M","format:parquet","modality:document","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-09-29T09:22:16.000Z","key":""},{"_id":"6942c26a0c5ab8715b67e281","id":"bshada/open-schematics","author":"bshada","disabled":false,"gated":false,"lastModified":"2026-06-30T06:18:49.000Z","likes":166,"trendingScore":3,"private":false,"sha":"5c74c3648d22e73426bbeb4691397d9cc8053e34","description":"\n\n\t\n\t\t\n\t\n\t\n\t\tOpen Schematics Dataset\n\t\n\nThe largest dataset of electronic schematics and PCB layouts on the internet, built as an engineering reference for schematic and PCB layout work. It's a self-growing, autonomous dataset that continuously scans the web for new engineering designs and updates itself accordingly.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nEach record corresponds to one schematic file and includes the raw source, rendered images, structured metadata, and all associated PCB files… See the full description on the dataset page: https://huggingface.co/datasets/bshada/open-schematics.","downloads":4028,"tags":["task_categories:text-generation","task_categories:image-to-text","task_categories:text-to-image","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","electronics","schematics","kicad","hardware","pcb","circuit-design","engineering"],"createdAt":"2025-12-17T14:47:06.000Z","key":""},{"_id":"695b5596af1ebac7b8a6b3d9","id":"Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M","author":"Logics-MLLM","disabled":false,"gated":false,"lastModified":"2026-01-19T02:18:04.000Z","likes":28,"trendingScore":3,"private":false,"sha":"0c7470d86e4f5fd91dd1fd89d885b00cce30fc03","description":"\n\t\n\t\t\n\t\tLogics-STEM-SFT-Dataset-2.2M\n\t\n\n\n\t\n\t\t\n\t\t📰 News\n\t\n\n\n[2026.01.05]🔥 Release of our Techinical Report.\n[2026.01.05]🔥 Release the first version of Logics-STEM-8B-SFT, Logics-STEM-8B-RL, /Logics-STEM-SFT-Dataset-Open-1.6M.\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tWhat is this dataset?\n\t\n\nLogics-STEM-SFT-Dataset-2.2M is a curated long Chain-of-Thought (CoT) SFT dataset for STEM reasoning, built on top of high-quality open-source data and enhanced through a rigorous curation and distillation… See the full description on the dataset page: https://huggingface.co/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M.","downloads":729,"tags":["license:cc-by-nc-4.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2601.01562","region:us"],"createdAt":"2026-01-05T06:09:26.000Z","key":""},{"_id":"6986cb617ee2b3c146bd2432","id":"openbmb/Ultra-FineWeb-L3","author":"openbmb","disabled":false,"gated":false,"lastModified":"2026-05-28T09:03:52.000Z","likes":305,"trendingScore":3,"private":false,"sha":"c68ab81ad03b2d2f476fa8ab3c72bed3528da359","description":"\n\t\n\t\t\n\t\tUltra-FineWeb-L3\n\t\n\n\n  \n\n\n\n📜 Ultra-FineWeb Technical Report |\n📦 UltraData Collection |\n🌐 UltraData | \n🤗 MiniCPM5 Series\n\n\n\nEnglish |\n中文\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📚 Introduction\n\t\n\nUltra-FineWeb-L3 is the L3 refined data for general high-quality web data within UltraData's L0-L4 tiered data management framework. Moving beyond L2 quality selection, it transforms high-value web corpora into structured, high-learnability training data with clearer reasoning signals and richer educational styles.… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/Ultra-FineWeb-L3.","downloads":81804,"tags":["task_categories:text-generation","language:en","language:zh","license:apache-2.0","size_categories:1B<n<10B","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2505.05427","arxiv:2602.09003","region:us","llm","pretraining","data-synthesis","data-filtering","high-quality","general-knowledge","qa-generation","multi-style-rewriting","minicpm"],"createdAt":"2026-02-07T05:19:29.000Z","key":""},{"_id":"69aa384095321b78051efd3c","id":"nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-27T01:07:11.000Z","likes":12,"trendingScore":3,"private":false,"sha":"4947a3c8ea803413a65f9eca14a96ef521b2ddf5","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Description:\n\t\n\nThe SWE-RL dataset provides GitHub issues for training and validating real-world software engineering agents using the OpenHands environment in NeMo Gym. The dataset is a refactored version of the SWE-Gym and R2E-Gym datasets to support the NeMo Gym input format.\nThis dataset is released as part of NVIDIA NeMo Gym, a framework for building reinforcement learning environments to train large language models. NeMo Gym contains a growing collection of training… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1.","downloads":10379,"tags":["license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2412.21139","arxiv:2504.07164","region:us"],"createdAt":"2026-03-06T02:13:20.000Z","key":""},{"_id":"69b1183046c6e7a964869ec4","id":"ropedia-ai/xperience-10m","author":"ropedia-ai","disabled":false,"gated":"manual","lastModified":"2026-04-21T05:03:45.000Z","likes":210,"trendingScore":3,"private":false,"sha":"ce943cf271a758b60240084892d05cf6dc12dd90","description":"\n\t\n\t\t\n\t\t⚠️ Important: If you have already submitted an access request but have not completed the required DocuSign agreement, your request will remain pending. Please complete signing and we will grant access once verified.\n\t\n\n\n  \n    \n  \n  \n  Interactive Intelligence from Human Xperience\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tXperience-10M\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nXperience-10M is a large-scale egocentric multimodal dataset of human experience for embodied AI, robotics, world models, and spatial… See the full description on the dataset page: https://huggingface.co/datasets/ropedia-ai/xperience-10m.","downloads":82127,"tags":["task_categories:video-classification","task_categories:image-to-text","task_categories:depth-estimation","task_categories:robotics","language:en","license:other","size_categories:1M<n<10M","modality:3d","modality:audio","modality:video","region:us","egocentric","first-person","multimodal","3d","4d","embodied-ai","robotics","human-motion","mocap","imu","audio","depth","captions","video"],"createdAt":"2026-03-11T07:22:24.000Z","key":""},{"_id":"69b38de28bcbe40d2d69828d","id":"nvidia/Nemotron-SFT-OpenCode-v1","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-03-23T23:32:38.000Z","likes":54,"trendingScore":3,"private":false,"sha":"556d5237acff203f3e1a0be49428634c3606cda2","description":"\n\t\n\t\t\n\t\tDataset Description:\n\t\n\nNemotron-SFT-OpenCode-v1 is an agentic instruction tuning dataset that enhances the ability of Large Language Models (LLMs) to operate within the OpenCode Command Line Interface (CLI) framework and instills simple capabilities such as tool calling and agent skills.\nThis dataset is ready for commercial/non-commercial use.\n\n\t\n\t\t\n\t\tDataset Subsets:\n\t\n\nNemotron-SFT-OpenCode-v1 contains the following subsets, where the questions and agent skills are synthetically… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-SFT-OpenCode-v1.","downloads":3503,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:100K<n<1M","region:us","opencode"],"createdAt":"2026-03-13T04:09:06.000Z","key":""},{"_id":"69bb68398c6348fee0d81d16","id":"WhynotHug/DRSeg","author":"WhynotHug","disabled":false,"gated":false,"lastModified":"2026-06-23T03:23:15.000Z","likes":4,"trendingScore":3,"private":false,"sha":"2b143f9a0721b7b5dbba4dd9f0bed9a22ede444d","description":"\n\t\n\t\t\n\t\n\t\n\t\tDRSeg: UAV Reasoning Segmentation Benchmark\n\t\n\nDRSeg is the dataset introduced with PixDLM for UAV reasoning segmentation. It\ncontains high-resolution UAV images, instance masks, reasoning questions,\nreasoning answers, and reasoning-type annotations.\n\n\t\n\t\t\n\t\n\t\n\t\tRelease Status\n\t\n\n\nPublic release for the PixDLM CVPR 2026 Highlight work.\nThe repository includes lightweight JSONL metadata for the HuggingFace Dataset\nViewer and a full DRSeg archive for training and evaluation.\nThe 2027… See the full description on the dataset page: https://huggingface.co/datasets/WhynotHug/DRSeg.","downloads":122,"tags":["task_categories:image-segmentation","task_categories:visual-question-answering","language:en","license:cc-by-nc-4.0","size_categories:10K<n<100K","format:json","modality:image","modality:tabular","modality:text","modality:geospatial","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","uav","remote-sensing","reasoning-segmentation","cvpr-2026","pixdlm"],"createdAt":"2026-03-19T03:06:33.000Z","key":""},{"_id":"69c1614485328596fd4b0c9e","id":"liumindmind/NekoQA-30K","author":"liumindmind","disabled":false,"gated":false,"lastModified":"2026-05-24T03:39:08.000Z","likes":33,"trendingScore":3,"private":false,"sha":"5e2af5cd42f05a99fea70653dd4b9fc0a4dfbbf2","downloads":311,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-03-23T15:50:28.000Z","key":""},{"_id":"69c2419dda894d8e5757461f","id":"eddmpython/dartlab-data","author":"eddmpython","disabled":false,"gated":false,"lastModified":"2026-06-30T08:21:42.000Z","likes":6,"trendingScore":3,"private":false,"sha":"6101c16bee7c3db3d883e2ad08584f47475b547a","description":"\n\n\n\n\n\nDartLab Data\n\nStructured company data from DART & EDGAR disclosure filings\nDART 전자공시 + EDGAR 공시 데이터 — 한국 2,700사 / 미국 970사\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tWhat is this?\n\t\n\n\n\nPre-collected Parquet files from DartLab — a Python library that turns DART (Korea) and EDGAR (US) disclosure filings into one structured company map.\n한국 DART 전자공시 시스템과 미국 SEC EDGAR에서 수집한 기업 공시 데이터입니다.\nThis dataset is the data layer behind DartLab. When you run dartlab.Company(\"005930\"), the library automatically downloads the… See the full description on the dataset page: https://huggingface.co/datasets/eddmpython/dartlab-data.","downloads":62673,"tags":["task_categories:table-question-answering","task_categories:text-classification","language:ko","language:en","license:apache-2.0","size_categories:1K<n<10K","region:us","finance","disclosure","dart","edgar","sec","xbrl","korea","financial-statements","corporate-filings","전자공시","재무제표","사업보고서","한국"],"createdAt":"2026-03-24T07:47:41.000Z","key":""},{"_id":"69c3ec2776527e3029415249","id":"queyuecanyang/MIRACLE","author":"queyuecanyang","disabled":false,"gated":false,"lastModified":"2026-06-29T16:42:07.000Z","likes":3,"trendingScore":3,"private":false,"sha":"bcc6f48f8bf7f0ec4c6dfae83982fd030cc835ed","description":"\n\t\n\t\t\n\t\n\t\n\t\tMIRACLE\n\t\n\nMIRACLE is a multimodal benchmark dataset with image-based questions and model evaluation results.\nThis repository contains the test subset of the MIRACLE benchmark. The benchmark config provides the benchmark test split, and the model_results config provides test-set evaluation outputs for the included models.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\nThe repository is organized as follows:\ndata/\n  test.parquet          # Hugging Face loadable benchmark test split\n  test.jsonl… See the full description on the dataset page: https://huggingface.co/datasets/queyuecanyang/MIRACLE.","downloads":2,"tags":["task_categories:visual-question-answering","task_categories:image-to-text","language:en","license:cc-by-4.0","size_categories:1K<n<10K","modality:image","modality:tabular","modality:text","region:us","multimodal","benchmark","vision-language"],"createdAt":"2026-03-25T14:07:35.000Z","key":""},{"_id":"69c9594e8e00378897ddbdc9","id":"Ujjwal-Tyagi/ai-ml-foundations-book-collection","author":"Ujjwal-Tyagi","disabled":false,"gated":false,"lastModified":"2026-04-24T15:29:09.000Z","likes":59,"trendingScore":3,"private":false,"sha":"69a3454da6e1cded1cfbb2e1c73df7d7e96d2d55","description":"\n\t\n\t\t\n\t\tIntroduction\n\t\n\nI put this collection together after spending a lot of time reading what I think are some of the best books on AI, machine learning, deep learning, probabilistic modeling, optimization, reinforcement learning, transformers, LLMs, validation, and fairness. I want to share this with the community for one simple reason: I want to give people a structured path through the books that actually help them understand things deeply, instead of sending them through random courses… See the full description on the dataset page: https://huggingface.co/datasets/Ujjwal-Tyagi/ai-ml-foundations-book-collection.","downloads":3140,"tags":["task_categories:text-generation","task_categories:text-classification","task_categories:question-answering","task_categories:summarization","task_categories:sentence-similarity","task_categories:feature-extraction","task_categories:zero-shot-classification","task_categories:text-retrieval","task_categories:token-classification","task_categories:multiple-choice","task_categories:fill-mask","language:en","license:apache-2.0","size_categories:n<1K","modality:document","library:datasets","library:mlcroissant","region:us","agent","ai","artificial-intelligence","machine-learning","ml","deep-learning","dl","neural-networks","representation-learning","supervised-learning","unsupervised-learning","semi-supervised-learning","self-supervised-learning","probabilistic-ml","bayesian-learning","statistical-learning","ml-theory","learning-theory","generalization","optimization","convex-optimization","gradient-descent","stochastic-gradient-descent","information-theory","entropy","kl-divergence","causal-inference","causality","decision-making","reinforcement-learning","rl","multi-agent","bandits","markov-decision-process","transformers","attention","large-language-models","llm","foundation-models","generative-ai","generative-models","diffusion-models","vae","gan","autoregressive-models","language-modeling","nlp","natural-language-processing","computer-vision","multimodal","embeddings","feature-extraction","transfer-learning","fine-tuning","prompt-engineering","rag","retrieval-augmented-generation","ai-agents","ai-engineering","ml-engineering","model-training","model-evaluation","validation","robustness","safety","trustworthy-ai","explainability","interpretability","fairness","bias","responsible-ai","datasets","dataset","benchmark","research","education","textbooks","books","learning-resources","study-guide","curriculum","knowledge-base","open-science","pytorch","tensorflow","huggingface","transformers-library"],"createdAt":"2026-03-29T16:54:38.000Z","key":""},{"_id":"69ce4ed851e88bbbcbc8e7fd","id":"prestoai/arabic-ecom-search-bench","author":"prestoai","disabled":false,"gated":false,"lastModified":"2026-06-28T13:52:48.000Z","likes":4,"trendingScore":3,"private":false,"sha":"e700ac9c100cfa0f5a1f7ffd5cdcbf9d49afef83","description":"\n\t\n\t\t\n\t\n\t\n\t\tArabicEcomSearchBench\n\t\n\n\n  \n\n\nBenchmark for end-to-end Arabic e-commerce retrieval systems, covering lexical, dense, hybrid, and multi-stage retrieval pipelines.\n\n\t\n\t\t\n\t\n\t\n\t\tWhy This Benchmark?\n\t\n\nExisting Arabic NLP benchmarks and MTEB focus heavily on embedding-level evaluation tasks — semantic similarity, classification, or general-purpose retrieval. These benchmarks:\n\nEvaluate components in isolation (embeddings, rerankers) rather than the full search pipeline a customer… See the full description on the dataset page: https://huggingface.co/datasets/prestoai/arabic-ecom-search-bench.","downloads":186,"tags":["task_categories:text-retrieval","language:ar","size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","e-commerce","arabic","search","retrieval","benchmark","libyan-dialect","msa","catalog-search","ndcg"],"createdAt":"2026-04-02T11:11:20.000Z","key":""},{"_id":"69d7079054a04b1f8d367f16","id":"llamaindex/ParseBench","author":"llamaindex","disabled":false,"gated":false,"lastModified":"2026-04-19T01:48:09.000Z","likes":100,"trendingScore":3,"private":false,"sha":"2805a1d940f95a203e0ae4b88be9934f7765b3fc","description":"\n\t\n\t\t\n\t\tParseBench\n\t\n\n\nQuick links: [🌐 Website] [📜 Paper] [💻 Code]\nParseBench is a benchmark for evaluating document parsing systems on real-world enterprise documents, with the following characteristics:\n\nMulti-dimensional evaluation. The benchmark is stratified into five capability dimensions — tables, charts, content faithfulness, semantic formatting, and visual grounding — each with task-specific metrics designed to capture what agentic workflows depend on.\nReal-world enterprise… See the full description on the dataset page: https://huggingface.co/datasets/llamaindex/ParseBench.","downloads":16548,"tags":["benchmark:official","benchmark:eval-yaml","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:document","modality:image","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2604.08538","region:us","document-parsing","pdf","benchmark","evaluation","tables","charts","ocr","layout-detection"],"createdAt":"2026-04-09T01:57:36.000Z","key":""},{"_id":"69e2d226bf20d3a18fad97af","id":"lordx64/reasoning-distill-opus-4-7-max-sft","author":"lordx64","disabled":false,"gated":false,"lastModified":"2026-04-20T22:38:18.000Z","likes":38,"trendingScore":3,"private":false,"sha":"1cbdcd72a8a6681b3713c1d31f01c711b816d1a4","description":"\n\t\n\t\t\n\t\tReasoning traces from Claude Opus 4.7 — SFT-ready\n\t\n\n7,823 single-turn reasoning conversations from Claude Opus 4.7 reformatted for supervised fine-tuning with trl.SFTTrainer + train_on_responses_only. Each row is a single text field containing a full Qwen-style chat-template conversation.\n\n\t\n\t\t\n\t\tProvenance\n\t\n\nEvery conversation's assistant response (including the <think>...</think> block) is output from claude-opus-4-7 with Anthropic's extended-thinking enabled. This is the… See the full description on the dataset page: https://huggingface.co/datasets/lordx64/reasoning-distill-opus-4-7-max-sft.","downloads":468,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:1K<n<10K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","reasoning","chain-of-thought","distillation","claude","opus-4-7","sft","qwen-chat-template"],"createdAt":"2026-04-18T00:36:54.000Z","key":""},{"_id":"69e4aa7ea8ad7ec14c63ae71","id":"Roman1111111/claude-sonnet-4.6-120000x","author":"Roman1111111","disabled":false,"gated":false,"lastModified":"2026-04-19T10:59:32.000Z","likes":75,"trendingScore":3,"private":false,"sha":"ab722bb8ea6e47386dc4c8227246640414037fe5","description":"license: mit\ntask_categories:\ntext-generation\ntext2text-generation\nlanguage:\nen\ntags:\nreasoning\nuncensored\nmath\ncode\nclaude-sonnet-4.6\nclaude-opus-4.6\ngemini-3.1-pro\nsize_categories:\n100K<n<1M\nPlease support if possible\n\n\n\n\n\nclaude-sonnet-4.6-natural-large\n\n\n\n\n\n\n\n\n\nSonnet4.6 NATURAL REASONING\nMulti-Domain(covered all possible topics in chats)/ Uncensored generated by claude sonnet 4.6(my biggest and most expensive project, i spent all my birthday money gifts for you guys❤️😁😭😭😭)\n\n\n\n\n01… See the full description on the dataset page: https://huggingface.co/datasets/Roman1111111/claude-sonnet-4.6-120000x.","downloads":709,"tags":["size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-04-19T10:12:14.000Z","key":""},{"_id":"69e803c8915b53b817584712","id":"Glint-Research/Opus-4.6-Reasoning-2160x","author":"Glint-Research","disabled":false,"gated":false,"lastModified":"2026-04-21T23:10:21.000Z","likes":13,"trendingScore":3,"private":false,"sha":"93c2ead46e6de1d340f087bb8185f70b726aae0c","description":"\n\t\n\t\t\n\t\n\t\n\t\tOpus-4.6-Reasoning-2160x\n\t\n\n2,160 high-quality reasoning traces generated by Claude Opus 4.6 via OpenRouter, covering mathematics, competitive programming, logic, science, and language tasks. Each example includes the full problem, an extended chain-of-thought, and a final solution — making the dataset suitable for supervised fine-tuning, chain-of-thought distillation, and reasoning-capability transfer to smaller models.\nOriginally generated as a batch of 3,305 examples; 1,145 were… See the full description on the dataset page: https://huggingface.co/datasets/Glint-Research/Opus-4.6-Reasoning-2160x.","downloads":337,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","reasoning","chain-of-thought","distillation","claude","opus","math","code","sft"],"createdAt":"2026-04-21T23:10:00.000Z","key":""},{"_id":"69f0c6101cc98d8ac04c03cd","id":"jasperai/monet","author":"jasperai","disabled":false,"gated":false,"lastModified":"2026-06-24T12:44:19.000Z","likes":138,"trendingScore":3,"private":false,"sha":"233245a0ccd478848159d1c86d761c1522a57699","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for MONET\n\t\n\nMONET (Massive, Open, Non-redundant and Enriched Text-to-image dataset) is a large-scale, curated image-text dataset designed for training text-to-image (T2I) systems. It contains 104.9 million high-quality image-text pairs distilled from 2.9 billion raw pairs across nine heterogeneous open sources (6 real and 3 synthetic) through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with… See the full description on the dataset page: https://huggingface.co/datasets/jasperai/monet.","downloads":180733,"tags":["task_categories:text-to-image","task_categories:image-feature-extraction","task_categories:zero-shot-image-classification","language:en","license:apache-2.0","size_categories:100M<n<1B","modality:image","arxiv:2605.21272","region:us","multimodal","image-text","captioning","text-to-image","synthetic-data"],"createdAt":"2026-04-28T14:37:04.000Z","key":""},{"_id":"69f338d97571697c5ad6b3d8","id":"ai4bharat/SpeechArenaBench","author":"ai4bharat","disabled":false,"gated":"auto","lastModified":"2026-06-19T09:14:38.000Z","likes":5,"trendingScore":3,"private":false,"sha":"fdd26e85ff051036e3eb49d7053d2b8f15c14ec0","description":"\n\t\n\t\t\n\t\n\t\n\t\tSpeechArenaBench\n\t\n\nSpeechArenaBench is a large-scale human-preference dataset for evaluating multilingual Text-to-Speech (TTS) systems across 10 Indian languages. It accompanies the paper \"Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages\" (accepted to Interspeech 2026), and contains the full benchmark sentences, generated audio, and crowd-sourced pairwise preference judgments collected from native raters.\nThe… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/SpeechArenaBench.","downloads":147,"tags":["language:bn","language:gu","language:hi","language:kn","language:ml","language:mr","language:or","language:ta","language:te","language:ur","license:mit","size_categories:100K<n<1M","format:parquet","format:optimized-parquet","modality:audio","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2604.21481","doi:10.57967/hf/9219","region:us","text-to-speech","tts","speech-synthesis","evaluation","human-preference","bradley-terry","indic-languages","multilingual","code-mixing","audio"],"createdAt":"2026-04-30T11:11:21.000Z","key":""},{"_id":"69fc1f1a2042bc11f9fc0092","id":"agents-last-exam/agents-last-exam","author":"agents-last-exam","disabled":false,"gated":false,"lastModified":"2026-06-12T18:28:44.000Z","likes":194,"trendingScore":3,"private":false,"sha":"b07f71f2b82477f02c8c4e1b885fa032e16aed86","description":"\n\t\n\t\t\n\t\n\t\n\t\tAgents Last Exam — Task Card Metadata (v1.0)\n\t\n\nA metadata-only release (v1.0) of 153 tasks from the Agents Last Exam (ALE)\nbenchmark for evaluating computer-use agents on long-horizon professional work.\n\n\t\n\t\t\n\t\n\t\n\t\tThe Agents Last Exam dataset family\n\t\n\nALE is published as three companion HuggingFace datasets:\n\n\t\n\t\t\nDataset\nContents\nAccess\n\n\n\t\t\nTask Card Metadata\nOne row per task: titles, prompts, taxonomy, input-file descriptors\nOpen\n\n\nTask Input Data\nThe input/ files each task… See the full description on the dataset page: https://huggingface.co/datasets/agents-last-exam/agents-last-exam.","downloads":8203,"tags":["language:en","license:cc-by-4.0","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","computer-use-agents","agent-benchmark","benchmark","evaluation"],"createdAt":"2026-05-07T05:11:54.000Z","key":""},{"_id":"6a05dde0225b64c39377f55d","id":"camvsl/Articraft-10K","author":"camvsl","disabled":false,"gated":false,"lastModified":"2026-05-15T05:05:02.000Z","likes":24,"trendingScore":3,"private":false,"sha":"3c79d5a05bb7cb6bf7bfee5e090176636ee3ac65","description":"This repository contains the 10k articulated 3D objects (in URDF format) from Articraft-10K.\nArticraft-10K is a large-scale articulated 3D dataset generated by the Articraft agent.\n","downloads":3680,"tags":["license:cc-by-4.0","region:us"],"createdAt":"2026-05-14T14:36:16.000Z","key":""},{"_id":"6a074fe9fac012db361bd2fb","id":"facebook/wearable-ai","author":"facebook","disabled":false,"gated":"auto","lastModified":"2026-06-19T00:15:39.000Z","likes":16,"trendingScore":3,"private":false,"sha":"486ad487cf8112bb76e8ab728747d872e626e632","description":"\n\t\n\t\t\n\t\n\t\n\t\tWearable AI Dataset (ECCV 2026)\n\t\n\nPart of the Wearable AI Workshop at ECCV 2026.\nA benchmark of egocentric (first-person, head-mounted wearable camera) videos paired with three complementary video question-answering tasks for evaluating wearable-AI assistants on real-world everyday activity videos.\n\n▶ Baseline code & evaluation scripts: see starter_kit/README.md. The starter kit ships inside this repo, so git clone gives you the code and the data together.\n\n\n\t\n\t\t\n\t\n\t\n\t\tTasks… See the full description on the dataset page: https://huggingface.co/datasets/facebook/wearable-ai.","downloads":2430,"tags":["task_categories:visual-question-answering","task_categories:video-text-to-text","language:en","license:cc-by-nc-4.0","size_categories:1K<n<10K","format:json","modality:text","modality:video","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","egocentric","wearable","first-person","video","conversation","long-video-qa","eccv2026"],"createdAt":"2026-05-15T16:55:05.000Z","key":""},{"_id":"6a0b002d95ec508059e1f5a2","id":"auren-research/cve-sft-v5","author":"auren-research","disabled":false,"gated":false,"lastModified":"2026-05-18T12:42:53.000Z","likes":7,"trendingScore":3,"private":false,"sha":"5a63142bf5c17b8336a79622181178bc5cb69808","description":"\n\t\n\t\t\n\t\tCVE SFT Dataset v5\n\t\n\n\n  \n  \n  \n  \n  \n\n\nCVE SFT Dataset v5 is a structured instruction-following dataset for fine-tuning language models on cybersecurity vulnerability analysis. Built by Auren Research, it combines authoritative vulnerability metadata from the NIST National Vulnerability Database (NVD) with five generated fields that teach models to explain, reason about, and remediate real-world CVEs — including side-by-side vulnerable vs. safe code examples.Unlike most security… See the full description on the dataset page: https://huggingface.co/datasets/auren-research/cve-sft-v5.","downloads":97,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","cybersecurity","vulnerability","cve","sft","fine-tuning","code-security","nvd","cvss","cwe","security-research","penetration-testing"],"createdAt":"2026-05-18T12:03:57.000Z","key":""},{"_id":"6a0e8046d4a533e55528fa04","id":"Qwen/Qwen-Image-Bench","author":"Qwen","disabled":false,"gated":false,"lastModified":"2026-05-28T08:09:42.000Z","likes":39,"trendingScore":3,"private":false,"sha":"d2493deb153b020cf169c7e3f57d15e4dd697038","description":"\n\t\n\t\t\n\t\tQwen-Image-Bench\n\t\n\n\n  \n  \n  \n  \n  A creator-centric benchmark for evaluating Text-to-Image models beyond semantic alignment.\n\n\t\n\t\t\n\t\tLinks\n\t\n\n\n\t\n\t\t\nResource\nLink\n\n\n\t\t\n📑 Paper\nhttp://arxiv.org/abs/2605.28091\n\n\n📊 Benchmark Dataset (HuggingFace)\nhttps://huggingface.co/datasets/Qwen/Qwen-Image-Bench\n\n\n📊 Benchmark Dataset (ModelScope)\nhttps://www.modelscope.cn/datasets/Qwen/Qwen-Image-Bench\n\n\n💻 GitHub\nhttps://github.com/QwenLM/Qwen-Image-Bench\n\n\n🧑‍⚖️ Q-Judger Model… See the full description on the dataset page: https://huggingface.co/datasets/Qwen/Qwen-Image-Bench.","downloads":13100,"tags":["task_categories:image-to-text","language:en","language:zh","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:image","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2605.28091","region:us","text-to-image","image-generation","benchmark","evaluation"],"createdAt":"2026-05-21T03:47:18.000Z","key":""},{"_id":"6a175676e914809f6b8eb009","id":"Longitude-Labs/bluefin-release","author":"Longitude-Labs","disabled":false,"gated":false,"lastModified":"2026-06-26T18:01:43.000Z","likes":3,"trendingScore":3,"private":false,"sha":"6a2a02db0a764b8da52214a573954143ecf68cea","description":"\n\t\n\t\t\n\t\n\t\n\t\tBluefin Release\n\t\n\nA spreadsheet-agent benchmark of financial-modeling tasks. Each task provides an\nExcel input workbook (and, where applicable, a reference output workbook plus a\ngrading rubric). This dataset is published in Arrow-compatible Parquet so it can\nbe loaded directly with 🤗 datasets — no custom loading script required.\nThe original per-task files remain in the tasks/ tree; the Parquet subsets\nunder data/ are a row-assembled, Arrow-native view of those same files.… See the full description on the dataset page: https://huggingface.co/datasets/Longitude-Labs/bluefin-release.","downloads":130,"tags":["language:en","license:cc-by-nc-4.0","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2605.30907","region:us","spreadsheet","agent-benchmark","finance"],"createdAt":"2026-05-27T20:39:18.000Z","key":""},{"_id":"6a1e00f439fe6ce4eb36640f","id":"nvidia/Nemotron-Math-Proofs-v2","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-16T04:25:35.000Z","likes":14,"trendingScore":3,"private":false,"sha":"7665d7f1d006fd89aa852a9dab8060c60b63f814","description":"\n\t\n\t\t\n\t\n\t\n\t\tNemotron-Math-Proofs-v2\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description:\n\t\n\nNemotron-Math-Proofs-v2 is a mathematical proof-generation, verification, and meta-verification trace dataset. The problems are sourced from nvidia/Nemotron-Math-Proofs-v1 only taking the AoPS subset. The release contains 82,737 samples across 5,752 unique problems.\nFor this version, solutions are generated using DeepSeek-V4-Pro on Max inference mode. The generation pipeline produces proofs, verification traces, and… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Math-Proofs-v2.","downloads":2836,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2511.22570","arxiv:2512.15489","region:us","math","proofs","mathematical-reasoning","text","human","synthetic","automated","post-training","Nemotron_3_Ultra"],"createdAt":"2026-06-01T22:00:20.000Z","key":""},{"_id":"6a1f5093da5f8c053f9daa7a","id":"ai4privacy/pii-masking-openpii-1.5m","author":"ai4privacy","disabled":false,"gated":false,"lastModified":"2026-06-03T06:53:47.000Z","likes":12,"trendingScore":3,"private":false,"sha":"a785eb528e28be2693c3718a27e066970de5dadb","description":"\n\t\n\t\t\n\t\n\t\n\t\tOpenPII 1.5M: Multilingual PII Masking Dataset (Asia Pacific Extension)\n\t\n\n\n  \n\n\n📖 More information: www.ai4privacy.com/datasets/pii-masking-3m-asia-pacific\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nThe OpenPII 1.5M dataset extends OpenPII 1M\nwith a new Asia Pacific corpus, bringing global coverage to 30 languages\nacross Europe, Americas, and Asia Pacific.\nThis is the flagship release of the PII-Masking-3M family, the world's\nlargest open multilingual PII masking corpus. Built to advance open… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1.5m.","downloads":1558,"tags":["task_categories:token-classification","task_categories:text-generation","source_datasets:extended","language:en","language:fr","language:de","language:es","language:it","language:nl","language:pt","language:bg","language:cs","language:da","language:el","language:et","language:fi","language:hr","language:hu","language:lt","language:lv","language:pl","language:ro","language:sk","language:sl","language:sr","language:sv","language:id","language:ja","language:ko","language:ms","language:tl","language:vi","language:zh","license:other","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","privacy","pii","sensitive-data","data-masking","data-anonymization","ner","synthetic","multilingual","ai4privacy","openpii","asia-pacific"],"createdAt":"2026-06-02T21:52:19.000Z","key":""},{"_id":"6a235fc46b7ad7825f11362c","id":"11-47/claude_opus_4.8_max_thinking_5k_v2","author":"11-47","disabled":false,"gated":false,"lastModified":"2026-06-06T00:09:54.000Z","likes":4,"trendingScore":3,"private":false,"sha":"f4351240b77cb1260613e44826781eb40f95ab02","description":"\n\t\n\t\t\n\t\n\t\n\t\tClaude Opus 4.8 MAX THINKING — Distillation Dataset\n\t\n\n5,000 high-quality examples designed to distill the maximum-effort reasoning, honest analysis, production software engineering, and agentic capabilities of Claude Opus 4.8.\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nThis dataset captures Opus 4.8’s signature strengths:\n\nDeep, structured, high-effort reasoning\nHonest communication about trade-offs and uncertainties\nExcellent production software engineering judgment\nStrong agentic workflow design… See the full description on the dataset page: https://huggingface.co/datasets/11-47/claude_opus_4.8_max_thinking_5k_v2.","downloads":534,"tags":["size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-05T23:46:12.000Z","key":""},{"_id":"6a26b9b27f9ff7ac4d8ba773","id":"sonatbaltaci/textindiagrams","author":"sonatbaltaci","disabled":false,"gated":false,"lastModified":"2026-06-23T09:42:01.000Z","likes":4,"trendingScore":3,"private":false,"sha":"0b0bdf17a7589accaaf5bb8bc97e84f6945fb860","description":"\n\t\n\t\t\n\t\n\t\n\t\tText Region Detection in Historical Astronomical Diagrams\n\t\n\nOfficial repository of the paper \"Text region detection in historical astronomical diagrams\". We introduce the first large, diverse, open-access dataset of 948 historical astronomical diagrams annotated with 10,940 oriented polygonal text regions that spans ten centuries (8th to 18th) and seven major traditions: Arabic, Persian, Chinese, Byzantine, Latin, Hebrew, and Sanskrit.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset 📜\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tContent… See the full description on the dataset page: https://huggingface.co/datasets/sonatbaltaci/textindiagrams.","downloads":124,"tags":["task_categories:object-detection","license:cc","size_categories:n<1K","format:parquet","format:optimized-parquet","modality:image","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-08T12:46:42.000Z","key":""},{"_id":"6a29e6f459c528b2c94210f9","id":"mbaye930/wolof-arabic-parallel-corpus","author":"mbaye930","disabled":false,"gated":false,"lastModified":"2026-06-11T16:52:35.000Z","likes":3,"trendingScore":3,"private":false,"sha":"99dc14279fff4fed70a8751bed741832377c546d","description":"\n\t\n\t\t\n\t\n\t\n\t\tMudawanSn: A Gold-Standard Wolof--Arabic Parallel Corpus for Machine Translation\n\t\n\nA publicly available parallel corpus for the Wolof–Arabic language pair, a gold-standard resource containing 1,271 sentence-aligned pairs. The corpus consists of manual translations from Wolof into Modern Standard Arabic (MSA). The source texts are drawn from the MasakhaNER corpus, covering politics, society, religion, and sports in Senegalese news discourse.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\nThe… See the full description on the dataset page: https://huggingface.co/datasets/mbaye930/wolof-arabic-parallel-corpus.","downloads":40,"tags":["task_categories:translation","language:wo","language:ar","license:cc-by-nc-sa-4.0","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-10T22:36:36.000Z","key":""},{"_id":"6a2a0f4091887681beb657d9","id":"sbintuitions/joyo-kanji-yomi-benchmark","author":"sbintuitions","disabled":false,"gated":false,"lastModified":"2026-06-29T02:37:04.000Z","likes":3,"trendingScore":3,"private":false,"sha":"2ae383bcdac88dacf89c9ab84f9990eec55fe5c9","description":"\n\t\n\t\t\n\t\n\t\n\t\tJoyo Kanji Yomi Benchmark\n\t\n\n\n\nA kanji-level pronunciation evaluation benchmark for Japanese TTS, covering all 2,136 Joyo kanji and their 4,378 readings with 13,095 native-speaker-verified test sentences.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nEach sample targets a specific kanji-reading pair. The sentence context is designed so that only the target reading is valid. All sentences and annotations have been verified by 35 native Japanese speakers through a three-stage review process.… See the full description on the dataset page: https://huggingface.co/datasets/sbintuitions/joyo-kanji-yomi-benchmark.","downloads":2,"tags":["task_categories:text-to-speech","language:ja","license:mit","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.25369","region:us","tts-evaluation","japanese","kanji","chinese-character","pronunciation"],"createdAt":"2026-06-11T01:28:32.000Z","key":""},{"_id":"6a2abdf2c2661173235127d0","id":"Gingiris/gingiris-twitter-agent-ops","author":"Gingiris","disabled":false,"gated":false,"lastModified":"2026-06-19T02:49:57.000Z","likes":6,"trendingScore":3,"private":false,"sha":"e4cd127900577935795a9c3e594c3b09150a749c","description":"\n\t\n\t\t\n\t\n\t\n\t\tTwitter/X Agent Operations — AI 自动运营完整 SOP\n\t\n\n\n🌍 Language / 语言: 中文 | English | 日本語 | 한국어\n\n\n实战验证：一个 AI agent 在 45 天内将 @WeiYipei 从 1,150 → 1,837 粉丝（+60%），日均发布 1 条，全程自动化运营。\n本 skill 适用于任何支持 system prompt 的 AI agent（Claude Code, Cursor, Trae, GPT）。\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t一、系统架构概览\n\t\n\n┌─────────────────────────────────────────────┐\n│          Twitter Agent Operations           │\n├─────────────────────────────────────────────┤\n│                                             │\n│  [1] 人设校准 ──→ [2] 素材库… See the full description on the dataset page: https://huggingface.co/datasets/Gingiris/gingiris-twitter-agent-ops.","downloads":117,"tags":["task_categories:text-generation","language:en","language:zh","language:ja","language:ko","license:mit","size_categories:n<1K","region:us","twitter","x-twitter","social-media","ai-agent","automation","content-marketing","growth","audience-building","scheduling","analytics","persona","sop","content-strategy"],"createdAt":"2026-06-11T13:53:54.000Z","key":""},{"_id":"6a2d645596c462b483de27bc","id":"Pythagoras-LM/SFT_Dataset","author":"Pythagoras-LM","disabled":false,"gated":false,"lastModified":"2026-06-22T15:42:51.000Z","likes":9,"trendingScore":3,"private":false,"sha":"6865661fe316db5f5fb0617ccbab6fd1e4b17a33","description":"\n\t\n\t\t\n\t\n\t\n\t\tPythagoras SFT Dataset\n\t\n\nProject Page | GitHub | Paper\n\n\t\n\t\t\n\t\n\t\n\t\tData\n\t\n\nOur training dataset consists of approximately 841K problems paired with Lean formal statements, formal proofs, and reasoning chains. We release a partial subset, which consists of 126K instances:\n\n30K easy instances\n49K medium instances\n47K hard instances\n\nComplete data will be released soon.\nThe complete explanation of the synthetic data generation pipeline can be found in Pythagoras-Prover: Advancing… See the full description on the dataset page: https://huggingface.co/datasets/Pythagoras-LM/SFT_Dataset.","downloads":227,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.12594","region:us","Lean4","Theorem-Proving","Formal-Reasoning"],"createdAt":"2026-06-13T14:08:21.000Z","key":""},{"_id":"6a2e2afbf32ef11c3cffab35","id":"agents-last-exam/agents-last-exam-data-archive","author":"agents-last-exam","disabled":false,"gated":"manual","lastModified":"2026-06-21T23:45:10.000Z","likes":7,"trendingScore":3,"private":false,"sha":"5ae9b719a901c14a9ccec7b3bd156d663e3eedcb","description":"\n\t\n\t\t\n\t\n\t\n\t\tAgents Last Exam — Task Data Archive (input + reference)\n\t\n\n⚠️ Gated dataset. This repo packages each task's input, software, and\nreference (ground-truth) data into a single archive (ale-tasks-data.tar.gz)\nfor convenient one-shot download — in particular for running ALE locally with\nthe local Docker provider,\nwhich fetches it and mounts each task's data at run time. Because it includes\nthe reference outputs used to score runs, access requires login, agreement to\nthe terms on the… See the full description on the dataset page: https://huggingface.co/datasets/agents-last-exam/agents-last-exam-data-archive.","downloads":149,"tags":["language:en","license:cc-by-4.0","region:us","computer-use-agents","agent-benchmark","benchmark","evaluation"],"createdAt":"2026-06-14T04:15:55.000Z","key":""},{"_id":"6a3119ee430e0b2b1bb718dc","id":"notune/fable5-repos","author":"notune","disabled":false,"gated":false,"lastModified":"2026-06-16T13:22:36.000Z","likes":6,"trendingScore":3,"private":false,"sha":"008e326fb42a1275069fb972e0338e77053358df","description":"\n\t\n\t\t\n\t\n\t\n\t\tFable 5 — All-Commits GitHub Repositories\n\t\n\nA collection of 7,090 public GitHub repositories whose entire default-branch\nhistory was written by Claude Fable 5 — every non-merge commit carries the\ntrailer:\nCo-Authored-By: Claude Fable 5 <noreply@anthropic.com>\n\nEach repository is stored as a full .tar.gz archive including its complete\n.git history, so you get every commit, message, and diff exactly as it\nappears on GitHub. A manifest.jsonl / manifest.csv table describes every\nrepo… See the full description on the dataset page: https://huggingface.co/datasets/notune/fable5-repos.","downloads":1972,"tags":["task_categories:text-generation","language:code","license:other","size_categories:1K<n<10K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","code","github","claude","claude-fable-5","ai-generated","source-code"],"createdAt":"2026-06-16T09:39:58.000Z","key":""},{"_id":"6a316714ddf9fa9fa55423b3","id":"TheFusionCube/Fable-5-CoT-Traces","author":"TheFusionCube","disabled":false,"gated":false,"lastModified":"2026-06-16T15:25:06.000Z","likes":7,"trendingScore":3,"private":false,"sha":"3624e3a8aab36a89dc3b89aebef0e034a64ea04e","description":"Personal collection of Fable 5 reasoning traces. \nFilter out the decoy ones and you're good. \nHave fun! (Also, star my repo https://github.com/FusionCube18712/claude-codex-auto-resume if you can)\nHappy distilling.\n","downloads":273,"tags":["language:en","license:mit","size_categories:n<1K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","claude","fable","distill"],"createdAt":"2026-06-16T15:09:08.000Z","key":""},{"_id":"6a318425abfe225c52dab20d","id":"BodyMaps/CancerVerse","author":"BodyMaps","disabled":false,"gated":false,"lastModified":"2026-06-29T20:15:34.000Z","likes":6,"trendingScore":3,"private":false,"sha":"f72399da18356320d0e825d1e5705d65e4f752c3","description":"\n\t\n\t\t\n\t\n\t\n\t\t🩻 CancerVerse\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tThe first longitudinal, multimodal CT dataset spanning 13 malignant tumor types\n\t\n\nCancerVerse pairs whole-body abdominal/pelvic CT volumes with the radiologists' own free-text reports, follows patients across time, and is being released in stages — culminating in expert voxel-level tumor masks and a deep clinical & longitudinal annotation layer. It is built to power the next generation of cancer-aware medical AI: tumor detection and segmentation… See the full description on the dataset page: https://huggingface.co/datasets/BodyMaps/CancerVerse.","downloads":24029,"tags":["task_categories:image-classification","task_categories:image-segmentation","task_categories:image-to-text","license:cc-by-nc-nd-4.0","size_categories:10K<n<100K","library:datasets","library:mlcroissant","region:us","medical-imaging","computed-tomography","oncology","cancer","radiology-reports","longitudinal","multimodal","vision-language"],"createdAt":"2026-06-16T17:13:09.000Z","key":""},{"_id":"6a33f8732bf5d1c979fed69d","id":"CodeDevX/Vibe-Coding-Instruct-V2","author":"CodeDevX","disabled":false,"gated":false,"lastModified":"2026-06-18T19:12:07.000Z","likes":9,"trendingScore":3,"private":false,"sha":"54a83576f89f376ce3733ca3d3e6bf03c1e8d394","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for Dataset Name\n\t\n\n\n\nThis dataset card aims to be a base template for new datasets. It has been generated using this raw template.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\n\n\n\n\n\nCurated by: [More Information Needed]\nFunded by [optional]: [More Information Needed]\nShared by [optional]: [More Information Needed]\nLanguage(s) (NLP): [More Information Needed]\nLicense: [More Information Needed]\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/CodeDevX/Vibe-Coding-Instruct-V2.","downloads":147,"tags":["task_categories:text-classification","language:en","license:apache-2.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","ai","custom","vibecodinginstruct","vibecodinginstructv2"],"createdAt":"2026-06-18T13:53:55.000Z","key":""},{"_id":"6a345ea12ca4658bb5dde980","id":"ArtificialAnalysis/AA-Briefcase-Lite","author":"ArtificialAnalysis","disabled":false,"gated":false,"lastModified":"2026-06-19T02:03:22.000Z","likes":7,"trendingScore":3,"private":false,"sha":"4dec557b47d43867a1648c0974db1d8208c8b677","description":"\n\t\n\t\t\n\t\n\t\n\t\tAA-Briefcase-Lite\n\t\n\nThe public example scenario for AA-Briefcase, Artificial Analysis' frontier agentic evaluation of realistic, long-horizon knowledge work.\n\nLeaderboard and detailed results\nLaunch article\n\nAA-Briefcase extends frontier model benchmarking beyond coding and short-form reasoning to the professional deliverables knowledge workers produce day to day. It consists of four private scenarios in which agents complete realistic professional workflows across data science… See the full description on the dataset page: https://huggingface.co/datasets/ArtificialAnalysis/AA-Briefcase-Lite.","downloads":661,"tags":["task_categories:other","language:en","license:apache-2.0","size_categories:n<1K","format:json","modality:document","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agents","agentic-evaluation","benchmark","private-equity","due-diligence"],"createdAt":"2026-06-18T21:09:53.000Z","key":""},{"_id":"6a356a6aaa2b91c4e1f29f9f","id":"empero-ai/MiniMax-M3-150k-Mixed","author":"empero-ai","disabled":false,"gated":false,"lastModified":"2026-06-19T16:13:06.000Z","likes":3,"trendingScore":3,"private":false,"sha":"26d7578ad9b3c3d13e299b4f572ed6e137efe717","description":"\n\t\n\t\t\n\t\n\t\n\t\tm3-alldomains-verified-107k\n\t\n\nVerified distillation traces generated with faststill v0.0.1 — a pipeline that generates (prompt, reasoning, output) triplets from any OpenAI-compatible chat-completions endpoint and deterministically verifies every row before keeping it. A row is verified=true only when a machine check (executed unit tests, exact / normalized answer compare) confirmed it, so wrong labels are filtered out instead of poisoning a student model.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset… See the full description on the dataset page: https://huggingface.co/datasets/empero-ai/MiniMax-M3-150k-Mixed.","downloads":89,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","faststill","synthetic","verified","distillation","code","reasoning","math"],"createdAt":"2026-06-19T16:12:26.000Z","key":""},{"_id":"6a369ac72d8e9d25da189c03","id":"DavidrPatton/Fable-5-GLM-5.2-Traces","author":"DavidrPatton","disabled":false,"gated":false,"lastModified":"2026-06-21T23:40:19.000Z","likes":5,"trendingScore":3,"private":false,"sha":"62cf21ec2bc6ce9404af020c4241e52cc5561b8c","description":"\n\t\n\t\t\n\t\n\t\n\t\tFable-5 + GLM-5.2 AGI-Expanded Traces\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDescription\n\t\n\nA premium fine-tuning dataset for training autonomous AGI agents. Contains 10,526 rows across 188 sessions of real coding agent traces, expanded with natural AGI-style chain-of-thought narration.\n\n\t\n\t\t\n\t\n\t\n\t\tStats\n\t\n\n\nTotal rows: 10,526\nTotal sessions: 188\nSize: 66.7 MB\nFormat: Unsloth SFTTrainer messages format\n\n\n\t\n\t\t\n\t\n\t\n\t\tAGI Narration Style\n\t\n\nEach row's assistant COT includes:\n\nTool call narration: \"Let me use… See the full description on the dataset page: https://huggingface.co/datasets/DavidrPatton/Fable-5-GLM-5.2-Traces.","downloads":281,"tags":["task_categories:text-generation","language:en","size_categories:10K<n<100K","region:us","agi","chain-of-thought","fine-tuning","qwen3","autonomous-agent","tool-use","memory","skills"],"createdAt":"2026-06-20T13:51:03.000Z","key":""},{"_id":"6a37a243a519cd301481eb31","id":"choucsan/mimo-claude-code-traces-1k","author":"choucsan","disabled":false,"gated":false,"lastModified":"2026-06-22T05:38:42.000Z","likes":4,"trendingScore":3,"private":false,"sha":"eab26ca8d624e8f7b816231aef9ae21667bc2c5d","description":"\n\t\n\t\t\n\t\n\t\n\t\tMIMO Claude Code Traces\n\t\n\nMIMO Claude Code Traces is a collection of coding-agent trajectories in a Claude Code-style environment. Each record contains a user coding task, the full multi-turn message trace, available tool schemas, assistant reasoning fields, tool calls, tool outputs, and metadata such as model name, category, duration, cost, token usage, and whether the trace used tools.\n\n  \n\n\nThe traces were generated with mimo-v2.5-pro, MiMo's most capable model at the time of… See the full description on the dataset page: https://huggingface.co/datasets/choucsan/mimo-claude-code-traces-1k.","downloads":1123,"tags":["task_categories:text-generation","task_categories:question-answering","task_categories:other","language:en","license:mit","size_categories:1K<n<10K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","code-agent","coding","agent-traces","format:json","format:agent-traces","tool-use","claude","software-engineering","code-generation","debugging","refactoring","shell","devops","reasoning","distillation","claude-fable-5","teich"],"createdAt":"2026-06-21T08:35:15.000Z","key":""},{"_id":"6a3824986e621593a39df279","id":"DomofonResearch/Tool-Reasoning-31K","author":"DomofonResearch","disabled":false,"gated":false,"lastModified":"2026-06-21T17:51:26.000Z","likes":4,"trendingScore":3,"private":false,"sha":"f2dae2e91ba89dc8bf3d60d74f22bea4968b106b","description":"\n\t\n\t\t\n\t\n\t\n\t\tTool-Reasoning-31K\n\t\n\n30,764 unique tool-use conversations in Anthropic-style ChatML, each assistant step prefixed\nwith a short first-person <reasoning>. The assistant decides whether and which tool to call,\ncalls it, reads a real tool result, and answers — across single-turn, multi-step, multi-turn, and\n\"no tool fits\" (relevance) scenarios.\nDerived from interstellarninja/hermes_reasoning_tool_use\n(which aggregates xLAM / ToolACE / Glaive / Nous-Hermes / Nvidia-When2Call). The… See the full description on the dataset page: https://huggingface.co/datasets/DomofonResearch/Tool-Reasoning-31K.","downloads":98,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","tool-use","function-calling","tool-calling","agentic","reasoning","chain-of-thought","thinking","chatml","messages","conversational"],"createdAt":"2026-06-21T17:51:20.000Z","key":""},{"_id":"6a38b528b22e71da5a862d50","id":"sequelbox/Titanium4-DeepSeek-V4-Pro","author":"sequelbox","disabled":false,"gated":false,"lastModified":"2026-06-22T04:36:35.000Z","likes":7,"trendingScore":3,"private":false,"sha":"4b6b6c1849c757f6071ffc1e2bf4b13ef54e0b2b","description":"Click here to support our open-source dataset and model releases - help us speed up our release schedule!\nTitanium 4 is an agentic coding dataset focused on DevOps and architecture, testing the limits of DeepSeek-V4-Pro's agentic skills:\n\nQuestions prioritize real-world, challenging agentic coding tasks in DevOps and architecture across a variety of programming languages and topics.\nAreas of focus include IaC, cloud architecture, incident response, configuration and cost optimization, security… See the full description on the dataset page: https://huggingface.co/datasets/sequelbox/Titanium4-DeepSeek-V4-Pro.","downloads":175,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","doi:10.57967/hf/9257","region:us","titanium","titanium-4","agentic","agentic-coding","python","dev-ops","devops","terraform","ansible","docker","jenkins","kubernetes","helm","grafana","prometheus","shell","bash","azure","aws","gcp","c++","c#","c","rust","java","javascript","typescript","algorithms","data-structures","concurrency","api","sql","database","auth","microservices","cloud","testing","tooling","embedded-systems","problem-solving","expert","architect","engineer","developer","instruct","creative","analytical","reasoning","rational","chat","chat-instruct","synthetic","conversational","deepseek","deepseek-v4","deepseek-v4-pro"],"createdAt":"2026-06-22T04:08:08.000Z","key":""},{"_id":"6a3907f29ed50d27aa76cb3a","id":"bigfacing/GOKU-2M","author":"bigfacing","disabled":false,"gated":"manual","lastModified":"2026-06-24T05:54:26.000Z","likes":5,"trendingScore":3,"private":false,"sha":"e89307f0ba4300650d6a140403d605d3ef5b9492","downloads":21,"tags":["region:us"],"createdAt":"2026-06-22T10:01:22.000Z","key":""},{"_id":"6a3bb61f524edb421a5c437d","id":"MAIR-Lab-HUST/SciIR-82k","author":"MAIR-Lab-HUST","disabled":false,"gated":false,"lastModified":"2026-06-26T10:14:40.000Z","likes":3,"trendingScore":3,"private":false,"sha":"2484fa2118fa3b3edecc82dfabf7271d142c1ede","description":"\n\t\n\t\t\n\t\n\t\n\t\tSciIR-82k Dataset\n\t\n\nSciIR-82k is a large-scale dataset for Scientific Image Reasoning Generation. It is designed to support the training and evaluation of text-to-image models that need to generate scientifically faithful visual content, rather than merely visually plausible illustrations.\nThe dataset contains more than 80,000 high-quality scientific image-text pairs. Each sample is derived from open-access scientific publications and enriched with structured reasoning annotations… See the full description on the dataset page: https://huggingface.co/datasets/MAIR-Lab-HUST/SciIR-82k.","downloads":260,"tags":["modality:text","region:us"],"createdAt":"2026-06-24T10:49:03.000Z","key":""},{"_id":"6a3bb9cd6c72e0bfc6e1e3dc","id":"hotdogs/uka-glm-5.2","author":"hotdogs","disabled":false,"gated":false,"lastModified":"2026-06-24T11:41:13.000Z","likes":3,"trendingScore":3,"private":false,"sha":"e2c4803d916fdc86d126361112cc391808168550","description":"\n\n\n\t\n\t\t\n\t\n\t\n\t\t🏆 uka GLM-5.2 Reasoning\n\t\n\nReasoning trace dataset for QLoRA fine-tuning of coding agents\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📋 Overview\n\t\n\nuka GLM-5.2 Reasoning is a curated reasoning trace dataset built from GLM-5.2 agent sessions, designed for QLoRA fine-tuning of coding agents.\n\n\t\n\t\t\n\t\n\t\n\t\tWhy Is It Easy to Use?\n\t\n\n\n\t\n\t\t\nFeature\nDescription\n\n\n\t\t\n🎯 Ready to Train\nChatML format — works directly with HuggingFace SFTTrainer, no conversion needed\n\n\n📦 Multiple Formats\nBoth JSONL (readable) and… See the full description on the dataset page: https://huggingface.co/datasets/hotdogs/uka-glm-5.2.","downloads":198,"tags":["task_categories:text-generation","task_ids:language-modeling","annotations_creators:machine-generated","language_creators:machine-generated","multilinguality:monolingual","source_datasets:DavidrPatton/Fable-5-GLM-5.2-Traces","source_datasets:AletheiaResearch/GLM-5.2-Agent","language:en","language:th","license:mit","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","reasoning-traces","chain-of-thought","glm-5.2","coding-agent","tool-use","distillation","sft","lora","agent-traces","synthetic-data","deduplicated","data-curation"],"createdAt":"2026-06-24T11:04:45.000Z","key":""},{"_id":"6a3be61d465b01e14bffcc34","id":"lightcone02/OmniContact-Dataset","author":"lightcone02","disabled":false,"gated":"auto","lastModified":"2026-06-29T15:02:45.000Z","likes":3,"trendingScore":3,"private":false,"sha":"db1526eb3ef16ddb461c8347bb226f4c6983b6b4","description":"\n\t\n\t\t\n\t\n\t\n\t\tOmniContact Dataset: Contact-Rich Humanoid Object Interaction\n\t\n\nProject Page\nThis dataset contains human-object interaction motion capture data and processed G1 humanoid trajectories for contact-rich box manipulation and soccer-style interactions. The main entry point is the provided visualizer, which lets users inspect the processed NPZ trajectory, original BVH motion, object pose, and contact labels side by side.\n\n  \n    \n      Carry box\n      \n        \n      \n      BVH source… See the full description on the dataset page: https://huggingface.co/datasets/lightcone02/OmniContact-Dataset.","downloads":41,"tags":["task_categories:robotics","task_categories:reinforcement-learning","language:en","license:cc-by-4.0","size_categories:10K<n<100K","modality:3d","arxiv:2606.26201","region:us","robotics","motion-capture","humanoid","human-object-interaction","contact-rich-manipulation","robot-learning","box","soccer"],"createdAt":"2026-06-24T14:13:49.000Z","key":""},{"_id":"6a3c104d17efd5eb8bd57585","id":"PleIAs/Telco-Common-Corpus","author":"PleIAs","disabled":false,"gated":false,"lastModified":"2026-06-25T09:29:43.000Z","likes":3,"trendingScore":3,"private":false,"sha":"7e8990d8bcd2e451371b17a5cc4329b4b051cf19","description":" Telco Common Corpus (TCC) is a ten billion tokens collection of fully open, free licensed telecommunications knowledge (scientific literature, patents, open data, and open-web projects) with licence and provenance verified at a document-level. \nTCC stems from GSMA's effort to make AI work for the telecom sector. The Open-Telco LLM Benchmarks and the broader Open Telco AI initiative have already established that current models fall short on real telecom tasks, including network management and… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Telco-Common-Corpus.","downloads":336,"tags":["language:en","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-24T17:13:49.000Z","key":""},{"_id":"621ffdd236468d709f181ea2","id":"coastalcph/lex_glue","author":"coastalcph","disabled":false,"gated":false,"lastModified":"2024-01-04T14:25:27.000Z","likes":77,"trendingScore":2,"private":false,"sha":"c23fdff1a6bf74e0e1a71cb86f1e781d37da888c","description":"\n\t\n\t\t\n\t\tDataset Card for \"LexGLUE\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nInspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2019), other previous multi-task NLP benchmarks (Conneau and Kiela, 2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a benchmark dataset to evaluate… See the full description on the dataset page: https://huggingface.co/datasets/coastalcph/lex_glue.","downloads":44264,"tags":["task_categories:question-answering","task_categories:text-classification","task_ids:multi-class-classification","task_ids:multi-label-classification","task_ids:multiple-choice-qa","task_ids:topic-classification","annotations_creators:found","language_creators:found","multilinguality:monolingual","source_datasets:extended","language:en","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2110.00976","arxiv:2109.00904","arxiv:1805.01217","arxiv:2104.08671","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181ec6","id":"ylecun/mnist","author":"ylecun","disabled":false,"gated":false,"lastModified":"2024-08-08T06:07:00.000Z","likes":252,"trendingScore":2,"private":false,"sha":"77f3279092a1c1579b2250db8eafed0ad422088c","description":"\n\t\n\t\t\n\t\tDataset Card for MNIST\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class.\nHalf of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.","downloads":133915,"paperswithcode_id":"mnist","tags":["task_categories:image-classification","task_ids:multi-class-image-classification","annotations_creators:expert-generated","language_creators:found","multilinguality:monolingual","source_datasets:extended|other-nist","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:image","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181f06","id":"openai/openai_humaneval","author":"openai","disabled":false,"gated":false,"lastModified":"2024-01-04T16:08:05.000Z","likes":396,"trendingScore":2,"private":false,"sha":"7dce6050a7d6d172f3cc5c32aa97f52fa1a2e544","description":"\n\t\n\t\t\n\t\tDataset Card for OpenAI HumanEval\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nThe programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.","downloads":233529,"paperswithcode_id":"humaneval","tags":["annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:mit","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2107.03374","region:us","code-generation"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f182008","id":"unimelb-nlp/wikiann","author":"unimelb-nlp","disabled":false,"gated":false,"lastModified":"2024-02-22T14:32:02.000Z","likes":123,"trendingScore":2,"private":false,"sha":"f0a3be6dc5564c0cc4150bb660144800a1f539d4","description":"\n\t\n\t\t\n\t\tDataset Card for WikiANN\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nWikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards… See the full description on the dataset page: https://huggingface.co/datasets/unimelb-nlp/wikiann.","downloads":15993,"paperswithcode_id":"wikiann-1","tags":["task_categories:token-classification","task_ids:named-entity-recognition","annotations_creators:machine-generated","language_creators:crowdsourced","multilinguality:multilingual","source_datasets:original","language:ace","language:af","language:als","language:am","language:an","language:ang","language:ar","language:arc","language:arz","language:as","language:ast","language:ay","language:az","language:ba","language:bar","language:be","language:bg","language:bh","language:bn","language:bo","language:br","language:bs","language:ca","language:cbk","language:cdo","language:ce","language:ceb","language:ckb","language:co","language:crh","language:cs","language:csb","language:cv","language:cy","language:da","language:de","language:diq","language:dv","language:el","language:eml","language:en","language:eo","language:es","language:et","language:eu","language:ext","language:fa","language:fi","language:fo","language:fr","language:frr","language:fur","language:fy","language:ga","language:gan","language:gd","language:gl","language:gn","language:gu","language:hak","language:he","language:hi","language:hr","language:hsb","language:hu","language:hy","language:ia","language:id","language:ig","language:ilo","language:io","language:is","language:it","language:ja","language:jbo","language:jv","language:ka","language:kk","language:km","language:kn","language:ko","language:ksh","language:ku","language:ky","language:la","language:lb","language:li","language:lij","language:lmo","language:ln","language:lt","language:lv","language:lzh","language:mg","language:mhr","language:mi","language:min","language:mk","language:ml","language:mn","language:mr","language:ms","language:mt","language:mwl","language:my","language:mzn","language:nan","language:nap","language:nds","language:ne","language:nl","language:nn","language:no","language:nov","language:oc","language:or","language:os","language:pa","language:pdc","language:pl","language:pms","language:pnb","language:ps","language:pt","language:qu","language:rm","language:ro","language:ru","language:rw","language:sa","language:sah","language:scn","language:sco","language:sd","language:sgs","language:sh","language:si","language:sk","language:sl","language:so","language:sq","language:sr","language:su","language:sv","language:sw","language:szl","language:ta","language:te","language:tg","language:th","language:tk","language:tl","language:tr","language:tt","language:ug","language:uk","language:ur","language:uz","language:vec","language:vep","language:vi","language:vls","language:vo","language:vro","language:wa","language:war","language:wuu","language:xmf","language:yi","language:yo","language:yue","language:zea","language:zh","license:unknown","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:1902.00193","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f182749","id":"SetFit/go_emotions","author":"SetFit","disabled":false,"gated":false,"lastModified":"2022-09-08T15:41:33.000Z","likes":11,"trendingScore":2,"private":false,"sha":"b83b20869f3b5eff89c0a375d8d015a878826094","description":"\n\t\n\t\t\n\t\tGoEmotions\n\t\n\nThis dataset is a port of the official go_emotions dataset on the Hub. It only contains the simplified subset as these are the only fields we need for text classification.\n","downloads":404,"tags":["size_categories:10K<n<100K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f182a80","id":"allenai/c4","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-01-09T19:14:03.000Z","likes":602,"trendingScore":2,"private":false,"sha":"1588ec454efa1a09f29cd18ddd04fe05fc8653a2","description":"\n\t\n\t\t\n\t\tC4\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nA colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: \"https://commoncrawl.org\".\nThis is the processed version of Google's C4 dataset\nWe prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4).\nFor reference, these are the sizes of the variants:\n\nen: 305GB\nen.noclean: 2.3TB\nen.noblocklist: 380GB\nrealnewslike: 15GB\nmultilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.","downloads":1084041,"paperswithcode_id":"c4","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:multilingual","source_datasets:original","language:af","language:am","language:ar","language:az","language:be","language:bg","language:bn","language:ca","language:ceb","language:co","language:cs","language:cy","language:da","language:de","language:el","language:en","language:eo","language:es","language:et","language:eu","language:fa","language:fi","language:fil","language:fr","language:fy","language:ga","language:gd","language:gl","language:gu","language:ha","language:haw","language:he","language:hi","language:hmn","language:ht","language:hu","language:hy","language:id","language:ig","language:is","language:it","language:iw","language:ja","language:jv","language:ka","language:kk","language:km","language:kn","language:ko","language:ku","language:ky","language:la","language:lb","language:lo","language:lt","language:lv","language:mg","language:mi","language:mk","language:ml","language:mn","language:mr","language:ms","language:mt","language:my","language:ne","language:nl","language:no","language:ny","language:pa","language:pl","language:ps","language:pt","language:ro","language:ru","language:sd","language:si","language:sk","language:sl","language:sm","language:sn","language:so","language:sq","language:sr","language:st","language:su","language:sv","language:sw","language:ta","language:te","language:tg","language:th","language:tr","language:uk","language:und","language:ur","language:uz","language:vi","language:xh","language:yi","language:yo","language:zh","language:zu","license:odc-by","size_categories:10B<n<100B","modality:text","arxiv:1910.10683","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"624bf3c5d84cc6ac39070753","id":"PolyAI/minds14","author":"PolyAI","disabled":false,"gated":false,"lastModified":"2025-08-12T09:22:26.000Z","likes":105,"trendingScore":2,"private":false,"sha":"40ce77cb32a384e4d50a568e1ec39ac804019d33","description":"\n\t\n\t\t\n\t\tMInDS-14\n\t\n\nMINDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 \nintents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.\n\n\t\n\t\t\n\t\tExample\n\t\n\nMInDS-14 can be downloaded and used as follows:\nfrom datasets import load_dataset\n\nminds_14 = load_dataset(\"PolyAI/minds14\", \"fr-FR\") # for French\n# to download all data for multi-lingual fine-tuning uncomment following… See the full description on the dataset page: https://huggingface.co/datasets/PolyAI/minds14.","downloads":8548,"tags":["task_categories:automatic-speech-recognition","task_ids:keyword-spotting","annotations_creators:expert-generated","annotations_creators:crowdsourced","annotations_creators:machine-generated","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","language:en","language:fr","language:it","language:es","language:pt","language:de","language:nl","language:ru","language:pl","language:cs","language:ko","language:zh","license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2104.08524","region:us","speech-recognition"],"createdAt":"2022-04-05T07:46:13.000Z","key":""},{"_id":"62a2090e467d335eba288b8a","id":"speechcolab/gigaspeech","author":"speechcolab","disabled":false,"gated":"auto","lastModified":"2026-02-07T05:59:48.000Z","likes":167,"trendingScore":2,"private":false,"sha":"63c0836b643dc6136a608de041e56b67c12649b3","description":"\n\t\n\t\t\n\t\tDataset Card for Gigaspeech\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nGigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training. The transcribed audio data is collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc.\n\n\t\n\t\t\n\t\tExample Usage\n\t\n\nThe training split has several configurations of… See the full description on the dataset page: https://huggingface.co/datasets/speechcolab/gigaspeech.","downloads":28930,"tags":["task_categories:automatic-speech-recognition","task_categories:text-to-speech","task_categories:text-to-audio","multilinguality:monolingual","language:en","license:apache-2.0","size_categories:10M<n<100M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2106.06909","doi:10.57967/hf/6261","region:us"],"createdAt":"2022-06-09T14:51:58.000Z","key":""},{"_id":"62fd6560cad078c7972fb1fd","id":"edinburghcstr/ami","author":"edinburghcstr","disabled":false,"gated":false,"lastModified":"2026-01-13T17:18:38.000Z","likes":91,"trendingScore":2,"private":false,"sha":"46f28f2503e2ec48f8867a84eef356c70476beab","description":"\n\t\n\t\t\n\t\tDataset Card for AMI\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThe AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals\nsynchronized to a common timeline. These include close-talking and far-field microphones, individual and\nroom-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings,\nthe participants also have unsynchronized pens available to them that record what is written. The meetings\nwere… See the full description on the dataset page: https://huggingface.co/datasets/edinburghcstr/ami.","downloads":7795,"tags":["task_categories:automatic-speech-recognition","multilinguality:monolingual","language:en","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:1906.11047","region:us"],"createdAt":"2022-08-17T22:02:08.000Z","key":""},{"_id":"63079b43a642348dc5e4f730","id":"kakaobrain/coyo-700m","author":"kakaobrain","disabled":false,"gated":false,"lastModified":"2022-08-30T19:07:52.000Z","likes":162,"trendingScore":2,"private":false,"sha":"54ee2d8c64d3d80a5e10ef6952a4466551834fc1","description":"\n\t\n\t\t\n\t\tDataset Card for COYO-700M\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nCOYO-700M is a large-scale dataset that contains 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models. Our dataset follows a similar strategy to previous vision-and-language datasets, collecting many informative pairs of alt-text and its associated image in HTML documents. We expect COYO to be used to train popular large-scale foundation models \ncomplementary to other… See the full description on the dataset page: https://huggingface.co/datasets/kakaobrain/coyo-700m.","downloads":2731,"tags":["task_categories:text-to-image","task_categories:image-to-text","task_categories:zero-shot-classification","task_ids:image-captioning","annotations_creators:no-annotation","language_creators:other","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-4.0","size_categories:100M<n<1B","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2102.05918","arxiv:2204.06125","arxiv:2010.11929","region:us","image-text pairs"],"createdAt":"2022-08-25T15:54:43.000Z","key":""},{"_id":"633a585e593f7e38374056ec","id":"bigcode/the-stack","author":"bigcode","disabled":false,"gated":"auto","lastModified":"2023-04-13T12:15:50.000Z","likes":1026,"trendingScore":2,"private":false,"sha":"349a71353fd5868fb90b593ef09e311379da498a","description":"\n\t\n\t\t\n\t\tDataset Card for The Stack\n\t\n\n\n\n\t\n\t\t\n\t\tChangelog\n\t\n\n\n\t\n\t\t\nRelease\nDescription\n\n\n\t\t\nv1.0\nInitial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.\n\n\nv1.1\nThe three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.","downloads":23127,"tags":["task_categories:text-generation","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","language:code","license:other","size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2211.15533","arxiv:2107.03374","arxiv:2207.14157","region:us"],"createdAt":"2022-10-03T03:34:54.000Z","key":""},{"_id":"639244f571c51c43091df168","id":"Anthropic/hh-rlhf","author":"Anthropic","disabled":false,"gated":false,"lastModified":"2023-05-26T18:47:34.000Z","likes":1800,"trendingScore":2,"private":false,"sha":"09be8c5bbc57cb3887f3a9732ad6aa7ec602a1fa","description":"\n\t\n\t\t\n\t\tDataset Card for HH-RLHF\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis repository provides access to two different kinds of data:\n\nHuman preference data about helpfulness and harmlessness from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. These data are meant to train preference (or reward) models for subsequent RLHF training. These data are not meant for supervised training of dialogue agents. Training dialogue agents on these data is likely to lead… See the full description on the dataset page: https://huggingface.co/datasets/Anthropic/hh-rlhf.","downloads":28394,"tags":["license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2204.05862","region:us","human-feedback"],"createdAt":"2022-12-08T20:11:33.000Z","key":""},{"_id":"63d02cc8da1e1edac8a744f8","id":"GBaker/MedQA-USMLE-4-options","author":"GBaker","disabled":false,"gated":false,"lastModified":"2023-01-24T19:18:09.000Z","likes":97,"trendingScore":2,"private":false,"sha":"0fb93dd23a7339b6dcd27e241cb9b5eca62d4d18","description":"Original dataset introduced by Jin et al. in What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams\nCitation information:\n\n@article{jin2020disease,\n  title={What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams},\n  author={Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter},\n  journal={arXiv preprint arXiv:2009.13081},\n  year={2020}\n}\n\n","downloads":16695,"tags":["language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2009.13081","region:us"],"createdAt":"2023-01-24T19:08:56.000Z","key":""},{"_id":"63d56d2963c8bec466d31748","id":"qwedsacf/competition_math","author":"qwedsacf","disabled":false,"gated":false,"lastModified":"2023-01-28T20:28:01.000Z","likes":134,"trendingScore":2,"private":false,"sha":"e839825f9ec5c6cfa585c654a59610969ec13993","description":"\n\t\n\t\t\n\t\tDataset Card for Mathematics Aptitude Test of Heuristics (MATH) dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe Mathematics Aptitude Test of Heuristics (MATH) dataset consists of problems\nfrom mathematics competitions, including the AMC 10, AMC 12, AIME, and more. \nEach problem in MATH has a full step-by-step solution, which can be used to teach\nmodels to generate answer derivations and explanations.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\n[More Information Needed]\n\n\t\n\t\t\n\t\tLanguages… See the full description on the dataset page: https://huggingface.co/datasets/qwedsacf/competition_math.","downloads":8642,"tags":["annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2103.03874","region:us","explanation-generation"],"createdAt":"2023-01-28T18:44:57.000Z","key":""},{"_id":"6425ce0937a416bff53ce5a7","id":"mattmdjaga/human_parsing_dataset","author":"mattmdjaga","disabled":false,"gated":false,"lastModified":"2024-03-03T14:17:41.000Z","likes":50,"trendingScore":2,"private":false,"sha":"db120bb5c18c146a8fbd2160f7575a288269fe7d","description":"\n\t\n\t\t\n\t\tDataset Card for Human parsing data (ATR)\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset has 17,706 images and mask pairs. It is just a copy of \nDeep Human Parsing ATR dataset. The mask labels are: \n \"0\": \"Background\",\n    \"1\": \"Hat\",\n    \"2\": \"Hair\",\n    \"3\": \"Sunglasses\",\n    \"4\": \"Upper-clothes\",\n    \"5\": \"Skirt\",\n    \"6\": \"Pants\",\n    \"7\": \"Dress\",\n    \"8\": \"Belt\",\n    \"9\": \"Left-shoe\",\n    \"10\": \"Right-shoe\",\n    \"11\": \"Face\",\n    \"12\": \"Left-leg\",\n    \"13\": \"Right-leg\",\n    \"14\":… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/human_parsing_dataset.","downloads":444,"tags":["task_categories:image-segmentation","task_ids:semantic-segmentation","size_categories:10K<n<100K","format:parquet","modality:image","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-03-30T17:59:37.000Z","key":""},{"_id":"644007b04164a65ca127b34b","id":"latentcat/animesfw","author":"latentcat","disabled":false,"gated":false,"lastModified":"2023-04-24T14:10:44.000Z","likes":44,"trendingScore":2,"private":false,"sha":"456ffaf26a0a571db20c60566dd8ee8336a73177","description":"\n\t\n\t\t\n\t\tDataset Card for \"animesfw\"\n\t\n\nMore Information needed\n","downloads":1438,"tags":["size_categories:1M<n<10M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-19T15:24:32.000Z","key":""},{"_id":"648b556b363cf923caddc497","id":"Open-Orca/OpenOrca","author":"Open-Orca","disabled":false,"gated":false,"lastModified":"2025-02-19T07:32:36.000Z","likes":1555,"trendingScore":2,"private":false,"sha":"e9c87b4abb2609913751f9b26553fdb9c061796c","description":"🐋 The OpenOrca Dataset! 🐋\n\n\n\nWe are thrilled to announce the release of the OpenOrca dataset!\nThis rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper.\nIt has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!\n\n\t\n\t\t\n\t\n\t\n\t\tOfficial Models\n\t\n\n\n\t\n\t\n\t\n\t\tMistral-7B-OpenOrca\n\t\n\nOur latest model, the first 7B to score better overall than all… See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.","downloads":17330,"tags":["task_categories:text-classification","task_categories:token-classification","task_categories:table-question-answering","task_categories:question-answering","task_categories:zero-shot-classification","task_categories:summarization","task_categories:feature-extraction","task_categories:text-generation","language:en","license:mit","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2306.02707","arxiv:2301.13688","arxiv:2302.13971","region:us"],"createdAt":"2023-06-15T18:16:11.000Z","key":""},{"_id":"649f37af37bfb5202beabdf4","id":"allenai/dolma","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-04-17T02:57:00.000Z","likes":1048,"trendingScore":2,"private":false,"sha":"7f48140530a023e9ea4c5cfb141160922727d4d3","citation":"@article{dolma,\n  title = {{Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}},\n  author = {\n    Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and\n    Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and\n    Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and\n    Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and\n    Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and\n    Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and\n    Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo\n},\n  year = {2024},\n  journal={arXiv preprint},\n}","description":"Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research","downloads":4284,"tags":["task_categories:text-generation","language:en","license:odc-by","size_categories:n>1T","arxiv:2402.00159","arxiv:2301.13688","region:us","language-modeling","casual-lm","llm"],"createdAt":"2023-06-30T20:14:39.000Z","key":""},{"_id":"64b67e1341d9fa8f906cfac4","id":"lmsys/chatbot_arena_conversations","author":"lmsys","disabled":false,"gated":"auto","lastModified":"2023-09-30T01:04:44.000Z","likes":465,"trendingScore":2,"private":false,"sha":"1b6335d42a1d2c7e34870c905d03ab964f7f2bd8","description":"\n\t\n\t\t\n\t\tChatbot Arena Conversations Dataset\n\t\n\nThis dataset contains 33K cleaned conversations with pairwise human preferences.\nIt is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023.\nEach sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.\nTo ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.","downloads":9788,"tags":["license:cc","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2306.05685","region:us"],"createdAt":"2023-07-18T11:57:07.000Z","key":""},{"_id":"64dbd28f00b80a024c762bd8","id":"glaiveai/glaive-function-calling-v2","author":"glaiveai","disabled":false,"gated":false,"lastModified":"2023-09-27T18:04:08.000Z","likes":515,"trendingScore":2,"private":false,"sha":"e7f4b6456019f5d8bcb991ef0dd67d8ff23221ac","downloads":40986,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-08-15T19:31:27.000Z","key":""},{"_id":"64f5336eecce6dc5895aaacd","id":"teknium/openhermes","author":"teknium","disabled":false,"gated":false,"lastModified":"2023-09-07T20:41:05.000Z","likes":224,"trendingScore":2,"private":false,"sha":"cbba06aea4dd066d19dc0985a930c13f803d3db3","description":"\n\t\n\t\t\n\t\tOpenHermes Dataset\n\t\n\n\nThe OpenHermes dataset is composed of 242,000 entries of primarily GPT-4 generated data, from open datasets across the AI landscape, including:\nOpenHermes 13B is the first fine tune of the Hermes dataset that has a fully open source dataset!\nOpenHermes was trained on 242,000 entries of primarily GPT-4 generated data, from open datasets across the AI landscape, including:\n\nGPTeacher - General Instruct, Roleplay v1, Roleplay v2, and Code Instruct Datasets, by… See the full description on the dataset page: https://huggingface.co/datasets/teknium/openhermes.","downloads":1067,"tags":["task_categories:text-generation","language:eng","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","distillation","synthetic data","gpt"],"createdAt":"2023-09-04T01:31:26.000Z","key":""},{"_id":"64f7c6e8baa3b4ec4e37b1d8","id":"open-web-math/open-web-math","author":"open-web-math","disabled":false,"gated":false,"lastModified":"2023-10-17T20:14:00.000Z","likes":350,"trendingScore":2,"private":false,"sha":"fde8ef8de2300f5e778f56261843dab89f230815","description":"\n\nKeiran Paster*, Marco Dos Santos*, Zhangir Azerbayev, Jimmy Ba\nGitHub  | ArXiv\n| PDF\nOpenWebMath is a dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens. OpenWebMath is intended for use in pretraining and finetuninglarge language models.\nYou can download the dataset using Hugging Face:\nfrom datasets import… See the full description on the dataset page: https://huggingface.co/datasets/open-web-math/open-web-math.","downloads":33233,"tags":["size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2310.06786","region:us"],"createdAt":"2023-09-06T00:25:12.000Z","key":""},{"_id":"64ff224ee7ef4e223d949733","id":"TIGER-Lab/MathInstruct","author":"TIGER-Lab","disabled":false,"gated":false,"lastModified":"2024-05-15T00:06:46.000Z","likes":305,"trendingScore":2,"private":false,"sha":"b4fdc323a7be1379c9c7c0b67b1de72dfee2111a","description":"\n\t\n\t\t\n\t\t🦣 MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning\n\t\n\nMathInstruct is a meticulously curated instruction tuning dataset that is lightweight yet generalizable. MathInstruct is compiled from 13 math rationale datasets, six of which are newly curated by this work. It uniquely focuses on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and ensures extensive coverage of diverse mathematical fields. \nProject Page:… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MathInstruct.","downloads":7145,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2309.05653","region:us","math"],"createdAt":"2023-09-11T14:21:02.000Z","key":""},{"_id":"651afb6f2590c4c624511419","id":"neural-bridge/rag-dataset-12000","author":"neural-bridge","disabled":false,"gated":false,"lastModified":"2024-02-05T18:25:13.000Z","likes":160,"trendingScore":2,"private":false,"sha":"586c0d0ddeb022fccd909c7b415cc2ca8660baa4","description":"\n\t\n\t\t\n\t\tRetrieval-Augmented Generation (RAG) Dataset 12000\n\t\n\nRetrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset designed for RAG-optimized models, built by Neural Bridge AI, and released under Apache license 2.0.\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nRetrieval-Augmented Generation (RAG) enhances large language models (LLMs) by allowing them to consult an external authoritative knowledge base before generating responses. This approach significantly… See the full description on the dataset page: https://huggingface.co/datasets/neural-bridge/rag-dataset-12000.","downloads":1068,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","retrieval-augmented-generation"],"createdAt":"2023-10-02T17:18:39.000Z","key":""},{"_id":"652739b50ef49cfb783ee99d","id":"EleutherAI/proof-pile-2","author":"EleutherAI","disabled":false,"gated":false,"lastModified":"2023-10-25T06:16:04.000Z","likes":227,"trendingScore":2,"private":false,"sha":"901a9273a770e9d4138c5ddd91802f9c5c6cdc4b","description":"A dataset of high quality mathematical text.","downloads":16377,"tags":["task_categories:text-generation","language:en","size_categories:10B<n<100B","arxiv:2310.10631","arxiv:2310.06786","region:us","math"],"createdAt":"2023-10-12T00:11:33.000Z","key":""},{"_id":"652cadad1a3250bbfe7b3834","id":"shengqin/web-attacks-ab2","author":"shengqin","disabled":false,"gated":false,"lastModified":"2023-10-16T03:31:52.000Z","likes":6,"trendingScore":2,"private":false,"sha":"9f79875449c5aa2c4e887a827d67af38b551e74e","downloads":55,"tags":["size_categories:10K<n<100K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-16T03:27:41.000Z","key":""},{"_id":"652d9c6f84595aeba1938350","id":"wlee44/attack","author":"wlee44","disabled":false,"gated":false,"lastModified":"2023-10-16T20:27:11.000Z","likes":2,"trendingScore":2,"private":false,"sha":"fd0c9628d2d2f56875351a342ff632d260c62ad4","downloads":24,"tags":["size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-16T20:26:23.000Z","key":""},{"_id":"6531c49c9222842fbd101a8a","id":"sebastiandizon/genius-song-lyrics","author":"sebastiandizon","disabled":false,"gated":false,"lastModified":"2023-10-20T00:09:37.000Z","likes":34,"trendingScore":2,"private":false,"sha":"065f07cbdbbce5f5b13d9a39dd872c77ad757474","downloads":34172,"tags":["size_categories:1M<n<10M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-20T00:06:52.000Z","key":""},{"_id":"65377f5989dd48faca8f7cf1","id":"HuggingFaceH4/ultrachat_200k","author":"HuggingFaceH4","disabled":false,"gated":false,"lastModified":"2024-10-16T11:52:27.000Z","likes":737,"trendingScore":2,"private":false,"sha":"8049631c405ae6576f93f445c6b8166f76f5505a","description":"\n\t\n\t\t\n\t\tDataset Card for UltraChat 200k\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThis is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model.\nThe original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:\n\nSelection of a subset of data for faster supervised fine tuning.\nTruecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.","downloads":57366,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2305.14233","region:us"],"createdAt":"2023-10-24T08:24:57.000Z","key":""},{"_id":"653c4d807bd6a97439da6505","id":"allenai/WildChat","author":"allenai","disabled":false,"gated":false,"lastModified":"2025-08-11T20:35:20.000Z","likes":195,"trendingScore":2,"private":false,"sha":"f66566ceaaeb619dd98ffb0f3bf3ce1f86775ac4","description":"\n\t\n\t\t\n\t\tDataset Card for WildChat\n\t\n\n\n\t\n\t\t\n\t\tNote: a newer version with 4.8 million conversations and demographic information can be found here.\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\nPaper: https://arxiv.org/abs/2405.01470\n\nInteractive Search Tool: https://wildvisualizer.com (paper)\n\nLicense: ODC-BY\n\nLanguage(s) (NLP): multi-lingual\n\nPoint of Contact: Yuntian Deng\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nWildChat is a collection of 650K conversations between human users and ChatGPT. We collected WildChat… See the full description on the dataset page: https://huggingface.co/datasets/allenai/WildChat.","downloads":10546,"tags":["task_categories:text-generation","task_categories:question-answering","license:odc-by","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2405.01470","arxiv:2409.03753","arxiv:2406.13706","region:us","instruction-finetuning"],"createdAt":"2023-10-27T23:53:36.000Z","key":""},{"_id":"654312358ed3d7ec0403cc41","id":"lavita/medical-qa-datasets","author":"lavita","disabled":false,"gated":false,"lastModified":"2023-11-17T20:49:51.000Z","likes":62,"trendingScore":2,"private":false,"sha":"59d48e2739fff1de8803ee59b97547ad51846650","description":"\nall-processed dataset is a concatenation of of medical-meadow-* and chatdoctor_healthcaremagic datasets\nThe Chat Doctor term is replaced by the chatbot term in the chatdoctor_healthcaremagic dataset\nSimilar to the literature the medical_meadow_cord19 dataset is subsampled to 50,000 samples\ntruthful-qa-* is a benchmark dataset for evaluating the truthfulness of models in text generation, which is used in Llama 2 paper. Within this dataset, there are 55 and 16 questions related to Health and… See the full description on the dataset page: https://huggingface.co/datasets/lavita/medical-qa-datasets.","downloads":2252,"tags":["task_categories:question-answering","language:en","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","medical","healthcare","clinical"],"createdAt":"2023-11-02T03:06:29.000Z","key":""},{"_id":"655100ea2adb0688a0042ddd","id":"teknium/OpenHermes-2.5","author":"teknium","disabled":false,"gated":false,"lastModified":"2024-04-15T08:18:12.000Z","likes":863,"trendingScore":2,"private":false,"sha":"b82037821055c377bed0d495e72e46de3bc72e84","description":"\n\n\t\n\t\t\n\t\tDataset Card for Dataset Name\n\t\n\nThis is the dataset that made OpenHermes 2.5 and Nous Hermes 2 series of models.\nSupport me on GitHub sponsors <3 : https://github.com/sponsors/teknium1\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThe Open Hermes 2/2.5 and Nous Hermes 2 models have made significant advancements of SOTA LLM's over recent months, and are underpinned by this exact compilation and curation of many open source datasets and custom created synthetic datasets.… See the full description on the dataset page: https://huggingface.co/datasets/teknium/OpenHermes-2.5.","downloads":15153,"tags":["language:eng","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","synthetic","GPT-4","Distillation","Compilation"],"createdAt":"2023-11-12T16:44:26.000Z","key":""},{"_id":"65589367ab0644b5315bb80e","id":"isaacus/open-australian-legal-qa","author":"isaacus","disabled":false,"gated":false,"lastModified":"2026-02-16T08:32:52.000Z","likes":23,"trendingScore":2,"private":false,"sha":"a2178afc2cff8af50fe4365c7d6f9cecf91dbe4b","description":"\n\t\n\t\t\n\t\tOpen Australian Legal QA ‍⚖️\n\t\n\n\nOpen Australian Legal QA by Isaacus is the first open dataset of Australian legal questions and answers.\nComprised of 2,124 questions and answers synthesised by gpt-4 from the Open Australian Legal Corpus, the largest open database of Australian law, the dataset is intended to facilitate the development of legal AI assistants in Australia.\nTo ensure its accessibility to as wide an audience as possible, the dataset is distributed under the same licence… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-qa.","downloads":538,"tags":["task_categories:question-answering","task_categories:text-generation","task_ids:closed-domain-qa","annotations_creators:machine-generated","language_creators:machine-generated","source_datasets:isaacus/open-australian-legal-corpus","language:en","license:other","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","doi:10.57967/hf/1479","region:us","law","legal","australia","question-answering","qa","question-answer","text-generation","llm","chatbot","conversational-ai","generative-ai","natural-language-understanding","fine-tuning"],"createdAt":"2023-11-18T10:35:19.000Z","key":""},{"_id":"655fe5a24a5a63bc00aa9f11","id":"danielz01/landmarks","author":"danielz01","disabled":false,"gated":false,"lastModified":"2023-11-23T23:57:47.000Z","likes":4,"trendingScore":2,"private":false,"sha":"c938656d6e667ff3b91f08343563d2765c166c49","description":"\n\t\n\t\t\n\t\tDataset Card for \"landmarks\"\n\t\n\nMore Information needed\n","downloads":112,"tags":["size_categories:n<1K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-11-23T23:52:02.000Z","key":""},{"_id":"658d2f37f893598fcaec7024","id":"GAIR/MathPile","author":"GAIR","disabled":false,"gated":"auto","lastModified":"2025-04-03T10:31:50.000Z","likes":195,"trendingScore":2,"private":false,"sha":"bde88d8c04147bdca0a569839a61067ac41f6bd2","description":"\n\n🔥Update: \n\n[2023/01/06] We release the commercial-use version of MathPile, namely MathPile_Commercial.\n[2023/01/06] We release the new version (v0.2, cleaner version) of MathPile. It has been updated to the main branch (also the v0.2 branch). The main updates are as follows:\nfixed a problem with the display of mathematical formulas in the Wikipedia subset, which was caused by the HTML conversion to markdown;\nfixed unclosed caption parentheses in the image environment in arXiv and macro… See the full description on the dataset page: https://huggingface.co/datasets/GAIR/MathPile.","downloads":255,"tags":["language:en","license:cc-by-nc-sa-4.0","size_categories:1B<n<10B","library:mlcroissant","arxiv:2312.17120","region:us","croissant"],"createdAt":"2023-12-28T08:17:59.000Z","key":""},{"_id":"65a80c2cc5ffe1d019bb40b0","id":"sdsr/erp-and-erotica","author":"sdsr","disabled":false,"gated":false,"lastModified":"2026-03-18T07:40:47.000Z","likes":6,"trendingScore":2,"private":false,"sha":"897e527f3b6fcbfd5b3843d6f02c231a62c30457","description":"Mirror of ERP/RP and erotica raw data collection (Edit: 07 Jan 2025 16:09 UTC)\n(No new data after 04 Jan 2024 12:01 UTC).\n","downloads":613,"tags":["task_categories:text-generation","language:en","size_categories:10M<n<100M","format:csv","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2024-01-17T17:19:40.000Z","key":""},{"_id":"65af4645d0a5cc99d51642da","id":"McAuley-Lab/Amazon-Reviews-2023","author":"McAuley-Lab","disabled":false,"gated":false,"lastModified":"2024-12-08T22:21:49.000Z","likes":315,"trendingScore":2,"private":false,"sha":"2b6d039ed471f2ba5fd2acb718bf33b0a7e5598e","description":"Amazon Review 2023 is an updated version of the Amazon Review 2018 dataset.\nThis dataset mainly includes reviews (ratings, text) and item metadata (desc-\nriptions, category information, price, brand, and images). Compared to the pre-\nvious versions, the 2023 version features larger size, newer reviews (up to Sep\n2023), richer and cleaner meta data, and finer-grained timestamps (from day to \nmilli-second).","downloads":119065,"tags":["language:en","size_categories:10B<n<100B","arxiv:2403.03952","region:us","recommendation","reviews"],"createdAt":"2024-01-23T04:53:25.000Z","key":""},{"_id":"65dc4b43102c3044815e3d0d","id":"CausalLM/Refined-Anime-Text","author":"CausalLM","disabled":false,"gated":"manual","lastModified":"2025-02-14T18:30:24.000Z","likes":273,"trendingScore":2,"private":false,"sha":"4ab3819c25bad66fad5fff2269d01e0c833638d0","description":"\n\t\n\t\t\n\t\tRefined Anime Text for Continual Pre-training of Language Models\n\t\n\nThis is a subset of our novel synthetic dataset of anime-themed text, containing over 1M entries, ~440M GPT-4/3.5 tokens. This dataset has never been publicly released before. We are releasing this subset due to the community's interest in anime culture, which is underrepresented in general-purpose datasets, and the low quality of raw text due to the prevalence of internet slang and irrelevant content, making it… See the full description on the dataset page: https://huggingface.co/datasets/CausalLM/Refined-Anime-Text.","downloads":16,"tags":["task_categories:text-generation","language:en","language:zh","license:wtfpl","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","synthetic"],"createdAt":"2024-02-26T08:26:43.000Z","key":""},{"_id":"65de38a84dab079f32475589","id":"gorilla-llm/Berkeley-Function-Calling-Leaderboard","author":"gorilla-llm","disabled":false,"gated":false,"lastModified":"2026-04-29T00:03:02.000Z","likes":113,"trendingScore":2,"private":false,"sha":"61fc0608cfd831fcfbbaa676ebdfef0ed963eeda","description":"\n\t\n\t\t\n\t\tBerkeley Function Calling Leaderboard\n\t\n\nThe Berkeley function calling leaderboard is a live leaderboard to evaluate the ability of different LLMs to call functions (also referred to as tools).\nWe built this dataset from our learnings to be representative of most users' function calling use-cases, for example, in agents, as a part of enterprise workflows, etc.\nTo this end, our evaluation dataset spans diverse categories, and across multiple languages.\nCheckout the Leaderboard at… See the full description on the dataset page: https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard.","downloads":55862,"tags":["language:en","license:apache-2.0","region:us"],"createdAt":"2024-02-27T19:31:52.000Z","key":""},{"_id":"65e6e4934b2e0f45e4eeabfe","id":"sean0042/KorMedMCQA","author":"sean0042","disabled":false,"gated":false,"lastModified":"2024-12-09T06:59:28.000Z","likes":38,"trendingScore":2,"private":false,"sha":"79efd6f91edfc8036330d7a4daa88b9f2deb9a82","description":"\n\t\n\t\t\n\t\tKorMedMCQA : Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations\n\t\n\nWe present KorMedMCQA, the first Korean Medical Multiple-Choice Question\nAnswering benchmark, derived from professional healthcare licensing\nexaminations conducted in Korea between 2012 and 2024. The dataset contains\n7,469 questions from examinations for doctor, nurse, pharmacist, and dentist,\ncovering a wide range of medical disciplines. We evaluate the performance of 59… See the full description on the dataset page: https://huggingface.co/datasets/sean0042/KorMedMCQA.","downloads":3932,"tags":["task_categories:question-answering","language:ko","license:cc-by-nc-2.0","size_categories:1K<n<10K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2403.01469","region:us","medical"],"createdAt":"2024-03-05T09:23:31.000Z","key":""},{"_id":"65fb8e51ed943e320d2b7232","id":"lukebarousse/data_jobs","author":"lukebarousse","disabled":false,"gated":false,"lastModified":"2025-06-03T16:13:27.000Z","likes":84,"trendingScore":2,"private":false,"sha":"ed776e5a0a8c40ea9d5efbd800772ae52e140f3e","description":"\n\t\n\t\t\n\t\t🧠 data_jobs Dataset\n\t\n\nA dataset of real-world data analytics job postings from 2023, collected and processed by Luke Barousse.\n\n\t\n\t\t\n\t\tBackground\n\t\n\nI've been collecting data on data job postings since 2022. I've been using a bot to scrape the data from Google, which come from a variety of sources. \nYou can find the full dataset at my app datanerd.tech.\n\nSerpapi has kindly supported my work by providing me access to their API. Tell them I sent you and get 20% off paid plans.… See the full description on the dataset page: https://huggingface.co/datasets/lukebarousse/data_jobs.","downloads":9781,"tags":["license:apache-2.0","size_categories:100K<n<1M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-03-21T01:33:05.000Z","key":""},{"_id":"65fc5a783bc54054aa2e6e62","id":"gretelai/synthetic_text_to_sql","author":"gretelai","disabled":false,"gated":false,"lastModified":"2025-12-16T19:17:20.000Z","likes":669,"trendingScore":2,"private":false,"sha":"740ab236e64503fba51be1101df7a1be83bf455d","description":"\n  \n  Image generated by DALL-E. See prompt for more details\n\n\n\n\t\n\t\t\n\t\tsynthetic_text_to_sql\n\t\n\n\ngretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, \ndesigned and generated using Gretel Navigator, and released under Apache 2.0.\nPlease see our release blogpost for more details.\nThe dataset includes:\n\n  105,851 records partitioned into 100,000 train and 5,851 test records\n  ~23M total tokens, including ~12M SQL tokens\n  Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.","downloads":3341,"tags":["task_categories:question-answering","task_categories:table-question-answering","task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","library:datadesigner","arxiv:2306.05685","region:us","synthetic","SQL","text-to-SQL","code","datadesigner"],"createdAt":"2024-03-21T16:04:08.000Z","key":""},{"_id":"660b1113aa6df80279626223","id":"ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions","author":"ProGamerGov","disabled":false,"gated":false,"lastModified":"2024-10-30T15:57:33.000Z","likes":153,"trendingScore":2,"private":false,"sha":"216b8becd8c112e2a87c08f9ce8f2d65247c8c09","description":"\n\t\n\t\t\n\t\tDataset Card for Dalle3 1 Million+ High Quality Captions\n\t\n\nAlt name: Human Preference Synthetic Dataset\n\n\n\nExample grids for landscapes, cats, creatures, and fantasy are also available.\n\n\n\t\n\t\t\n\t\tDescription:\n\t\n\nThis dataset comprises of AI-generated images sourced from various websites and individuals, primarily focusing on Dalle 3 content, along with contributions from other AI systems of sufficient quality like Stable Diffusion and Midjourney (MJ v5 and above). As users typically… See the full description on the dataset page: https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions.","downloads":1060,"tags":["task_categories:text-to-image","task_categories:image-classification","task_categories:image-to-text","task_categories:image-text-to-text","task_categories:other","language:en","license:mit","size_categories:1M<n<10M","modality:image","modality:text","region:us","image","text","image-text-dataset","synthetic-dataset","CogVLM","synthetic data","dalle3","dalle-3","DALL·E 3","midjourney","stable diffusion","Llama3"],"createdAt":"2024-04-01T19:54:59.000Z","key":""},{"_id":"660d79c9e8379a6f24fd5288","id":"wofmanaf/ego4d-video","author":"wofmanaf","disabled":false,"gated":false,"lastModified":"2024-04-10T11:18:39.000Z","likes":16,"trendingScore":2,"private":false,"sha":"b0d0895ae6b9461fc8ccf1d0d56edd1f9e3d9042","description":"EgoCOT is a large-scale embodied planning dataset, which selected egocentric videos from the Ego4D dataset and corresponding high-quality step-by-step language instructions, which are machine generated, then semantics-based filtered, and finally human-verified.\nFor mored details, please visit EgoCOT_Dataset.\nIf you find this dataset useful, please consider citing the paper,\n@article{mu2024embodiedgpt,\n  title={Embodiedgpt: Vision-language pre-training via embodied chain of thought}… See the full description on the dataset page: https://huggingface.co/datasets/wofmanaf/ego4d-video.","downloads":637,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-03T15:46:17.000Z","key":""},{"_id":"661da64be6e166452da68324","id":"PleIAs/YouTube-Commons","author":"PleIAs","disabled":false,"gated":false,"lastModified":"2024-06-26T08:08:14.000Z","likes":391,"trendingScore":2,"private":false,"sha":"9addbabbfcd7409acbcd11a3b59ec2aef6da7eb0","description":"\n\t\n\t\t\n\t\t📺 YouTube-Commons 📺\n\t\n\nYouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.\n\n\t\n\t\t\n\t\tContent\n\t\n\nThe collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels).\nIn total, this represents nearly 45 billion words (44,811,518,375).\nAll the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.","downloads":2491,"tags":["task_categories:text-generation","language:en","language:fr","language:es","language:pt","language:de","language:ru","license:cc-by-4.0","region:us","conversational"],"createdAt":"2024-04-15T22:12:27.000Z","key":""},{"_id":"664fd526d5bea69bcaf7e73f","id":"vikhyatk/openimages-bbox","author":"vikhyatk","disabled":false,"gated":false,"lastModified":"2024-07-10T00:29:23.000Z","likes":10,"trendingScore":2,"private":false,"sha":"9c9a38be0f64b5f6caf2fcac4a00e2b155f1b7ad","description":"Images and nounding box annotations from the OpenImages dataset.\n","downloads":30404,"tags":["size_categories:1M<n<10M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-05-23T23:45:42.000Z","key":""},{"_id":"66561cbaa57d0c38363003a7","id":"mlabonne/harmless_alpaca","author":"mlabonne","disabled":false,"gated":false,"lastModified":"2024-05-30T09:03:22.000Z","likes":44,"trendingScore":2,"private":false,"sha":"02c6a92cfcf11bb0c387334f8146d149d65b587f","downloads":15746,"tags":["language:en","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-05-28T18:04:42.000Z","key":""},{"_id":"66708709d5c5d8fd8db3a5cf","id":"mlfoundations/dclm-baseline-1.0","author":"mlfoundations","disabled":false,"gated":false,"lastModified":"2024-07-22T15:27:52.000Z","likes":290,"trendingScore":2,"private":false,"sha":"a3b142c183aebe5af344955ae20836eb34dcf69b","description":"\n\t\n\t\t\n\t\tDCLM-baseline\n\t\n\nDCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks.\nBelow are comparisions of model trained on DCLM-baseline with other models in the 7B regime.\n\n\t\n\t\t\nModel\nParams\nTokens\nOpen dataset?\nCORE\nMMLU\nEXTENDED\n\n\n\t\t\nOpen weights, closed datasets\n\n\n\n\n\n\n\n\nLlama2\n7B\n2T\n✗\n49.2\n45.8\n34.1\n\n\nDeepSeek\n7B\n2T\n✗\n50.7\n48.5\n35.3\n\n\nMistral-0.3\n7B\n?\n✗\n57.0\n62.7\n45.1\n\n\nQWEN-2\n7B\n?\n✗\n57.5\n71.9\n50.5\n\n\nLlama3\n8B\n15T\n✗\n57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.","downloads":496486,"tags":["license:cc-by-4.0","arxiv:2406.11794","region:us"],"createdAt":"2024-06-17T18:57:13.000Z","key":""},{"_id":"6685442540d7dba6c178c3dc","id":"eth-nlped/mathdial","author":"eth-nlped","disabled":false,"gated":false,"lastModified":"2025-02-26T07:34:18.000Z","likes":16,"trendingScore":2,"private":false,"sha":"acc3878459e0bd8c04ab840056572f0b8b1abe1f","description":"\n\t\n\t\t\n\t\tMathdial dataset\n\t\n\nhttps://arxiv.org/abs/2305.14536\nMathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems.\nMathDial is grounded in math word problems as well as student confusions which provide a challenging testbed for creating faithful and equitable dialogue tutoring models able to reason over complex information. Current models achieve high accuracy in solving such problems but they fail in the task of teaching.\n\n\t\n\t\t\n\t\n\t\n\t\tData… See the full description on the dataset page: https://huggingface.co/datasets/eth-nlped/mathdial.","downloads":389,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2305.14536","region:us","dialog","tutoring","math","gsm8k","conversation","education"],"createdAt":"2024-07-03T12:29:25.000Z","key":""},{"_id":"66952974b8a00bc24d6b112a","id":"HuggingFaceTB/smollm-corpus","author":"HuggingFaceTB","disabled":false,"gated":false,"lastModified":"2024-09-06T07:04:57.000Z","likes":469,"trendingScore":2,"private":false,"sha":"3ba9d605774198c5868892d7a8deda78031a781f","description":"\n\t\n\t\t\n\t\tSmolLM-Corpus\n\t\n\nThis dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. \nYou can find more details about the models trained on this dataset in our SmolLM blog post.\n\n\t\n\t\t\n\t\tDataset subsets\n\t\n\n\n\t\n\t\t\n\t\tCosmopedia v2\n\t\n\nCosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.","downloads":32809,"tags":["language:en","license:odc-by","size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-15T13:51:48.000Z","key":""},{"_id":"669d59d7dac1eb34c0bd1107","id":"fixie-ai/common_voice_17_0","author":"fixie-ai","disabled":false,"gated":false,"lastModified":"2025-01-17T02:41:14.000Z","likes":17,"trendingScore":2,"private":false,"sha":"34f78a43893414e7b6e271ba94c1d5e05f18b239","downloads":92337,"tags":["size_categories:10M<n<100M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-21T18:56:23.000Z","key":""},{"_id":"66b430108103b780544582d7","id":"jasongzy/Mixamo","author":"jasongzy","disabled":false,"gated":"auto","lastModified":"2025-03-05T01:36:53.000Z","likes":19,"trendingScore":2,"private":false,"sha":"b1c7f4975ea3261d3d0aa2379f6e24754ccde9d8","description":"\n\t\n\t\t\n\t\tMake-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters\n\t\n\n\nPaper\nProject Page\n\n\n\t\n\t\t\n\t\tData\n\t\n\n\ncharacter: 95 characters (T-pose with bones and texture) downloaded from Mixamo\n\ncharacter_fbx_upgraded: 51 (among 95) characters with FBX version upgraded by FBX Converter (so that they can be imported into Blender)\n\ncharacter_refined: all 95 characters (triangle mesh without texture, animatable by any one from animation) processed with character_refine.py… See the full description on the dataset page: https://huggingface.co/datasets/jasongzy/Mixamo.","downloads":447,"tags":["modality:3d","arxiv:2411.18197","region:us","3d"],"createdAt":"2024-08-08T02:40:16.000Z","key":""},{"_id":"66dab5c4204cd0a4f82c05be","id":"TommyChien/UltraDomain","author":"TommyChien","disabled":false,"gated":false,"lastModified":"2024-09-09T02:48:23.000Z","likes":57,"trendingScore":2,"private":false,"sha":"aa8a51d523f8fc3c5a0ab90dd16b7f6b9dbb5d0d","description":"For the usage of this benchmark dataset, please refer to this repo.\n","downloads":853,"tags":["task_categories:question-answering","language:en","license:apache-2.0","region:us"],"createdAt":"2024-09-06T07:56:52.000Z","key":""},{"_id":"66db404182706c13053bba43","id":"Guilherme34/uncensor","author":"Guilherme34","disabled":false,"gated":false,"lastModified":"2024-09-06T17:47:57.000Z","likes":77,"trendingScore":2,"private":false,"sha":"a7c5a771f9f44fb04f35ec9f05abeccd32ecca83","downloads":80,"tags":["region:us"],"createdAt":"2024-09-06T17:47:45.000Z","key":""},{"_id":"66dee2af27a9f456b972384d","id":"jingyaogong/minimind_dataset","author":"jingyaogong","disabled":false,"gated":false,"lastModified":"2026-04-09T08:27:39.000Z","likes":108,"trendingScore":2,"private":false,"sha":"312afb4f76391145c6902f765bb51691c09a12f5","description":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📌 数据介绍\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tⅠ Tokenizer\n\t\n\n分词器可以粗略理解成 LLM 使用的一本“词典”，负责把自然语言映射成 token id，再把 token id 解码回文本；项目中也提供了train_tokenizer.py作为词表训练示例。不建议重新训练 tokenizer，因为词表和切分规则一旦变化，模型权重、数据格式、推理接口与社区生态的兼容性都会下降，也会削弱模型的传播性。同时，tokenizer 还会影响 PPL 这类按 token 统计的指标，因此跨 tokenizer 比较时，BPB（Bits Per Byte）往往更有参考价值，可参考这篇。\n对 MiniMind 这类小模型来说，词表大小还会直接影响 embedding 层和输出层的参数占比，因此保持词表精简通常是更合适的取舍。\nTokenizer介绍\n\n第三方强大的开源模型例如 Yi、Qwen2、ChatGLM、Mistral、Llama 3 的 tokenizer 词表长度如下：\n\n  Tokenizer模型词表大小来源… See the full description on the dataset page: https://huggingface.co/datasets/jingyaogong/minimind_dataset.","downloads":2467,"tags":["task_categories:text-generation","language:multilingual","license:apache-2.0","license:cc-by-nc-2.0","region:us","chat","sft","instruction-tuning","reasoning","code","agent"],"createdAt":"2024-09-09T11:57:35.000Z","key":""},{"_id":"66ec310ff6a692d629b2667b","id":"wikimedia/structured-wikipedia","author":"wikimedia","disabled":false,"gated":false,"lastModified":"2026-05-19T12:54:16.000Z","likes":386,"trendingScore":2,"private":false,"sha":"417c267bb457fa645c22eb3b5c77764963194c70","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for Wikimedia Structured Wikipedia\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tQuick Links\n\t\n\n\nWikimedia Enterprise\nStructured Contents Documentation\nData Dictionary\nWikimedia Attribution Framework\nMeta-Wiki Discussion\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nPre-parsed English and French Wikipedia articles, extracted using the Wikimedia Enterprise Snapshot API.\nThis dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and output as structured data with a… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.","downloads":16481,"tags":["language:en","language:fr","license:cc-by-sa-4.0","size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","wikipedia","wikimedia","structured-data","parquet","knowledge-base","references","citations","tables","multilingual"],"createdAt":"2024-09-19T14:11:27.000Z","key":""},{"_id":"66f830e08d215c6331bec22a","id":"nvidia/OpenMathInstruct-2","author":"nvidia","disabled":false,"gated":false,"lastModified":"2024-11-25T20:07:28.000Z","likes":247,"trendingScore":2,"private":false,"sha":"469216e3f46f4dacf476b382e192485ea51a143e","description":"\n\t\n\t\t\n\t\tOpenMathInstruct-2\n\t\n\nOpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs \ngenerated using the Llama3.1-405B-Instruct model.\nThe training set problems of GSM8K\nand MATH are used for constructing the dataset in the following ways: \n\nSolution augmentation: Generating chain-of-thought solutions for training set problems in GSM8K and MATH. \nProblem-Solution augmentation: Generating new problems, followed by solutions for these new problems.… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/OpenMathInstruct-2.","downloads":81247,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2410.01560","region:us","math","nvidia"],"createdAt":"2024-09-28T16:37:52.000Z","key":""},{"_id":"67118b6445f13cb79173176c","id":"facebook/OMAT24","author":"facebook","disabled":false,"gated":false,"lastModified":"2025-12-11T19:29:18.000Z","likes":73,"trendingScore":2,"private":false,"sha":"ca00b8aac581497cd0a13146fb3708ad8e2e2fe8","description":"Meta Open Materials 2024 (OMat24) Dataset\n\n\n  \n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nSeveral datasets were utilized in this work. We provide open access to all datasets used to help accelerate research in the community.\nThis includes the OMat24 dataset as well as our modified sAlex dataset. Details on the different datasets are provided below.\nThe OMat24 datasets can be used with the FAIRChem package. See section on \"How to read the data\" below for a minimal example.\n\n\t\n\t\t\n\t\n\t\n\t\tDatasets\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tOMat24… See the full description on the dataset page: https://huggingface.co/datasets/facebook/OMAT24.","downloads":333,"tags":["license:cc-by-4.0","arxiv:2410.12771","doi:10.57967/hf/8434","region:us","chemistry","materials"],"createdAt":"2024-10-17T22:10:44.000Z","key":""},{"_id":"673f1aaaa7dc37c56d62bf13","id":"HuggingFaceTB/smol-smoltalk","author":"HuggingFaceTB","disabled":false,"gated":false,"lastModified":"2025-02-06T10:35:19.000Z","likes":104,"trendingScore":2,"private":false,"sha":"f73fe857d519ff6ac5af2ea67c4d3834da7b8bcc","description":"\n\t\n\t\t\n\t\tSmol-SmalTalk\n\t\n\nThis is a subset of SmolTalk dataset adapted for smol models with less than 1B parameters. We used it to build SmolLM2-360M-Instruct and \nSmolLM2-135M-Instruct. We do SFT on this dataset and then DPO on UltraFeedback.\nCompared to SmolTalk: \n\nThe conversations from Smol-Magpie-Ultra are shorter in this dataset\nWe include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models have… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk.","downloads":9893,"tags":["language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2502.02737","region:us","synthetic"],"createdAt":"2024-11-21T11:34:02.000Z","key":""},{"_id":"67449661149efb6edaa63b98","id":"HuggingFaceTB/finemath","author":"HuggingFaceTB","disabled":false,"gated":false,"lastModified":"2025-02-06T10:31:11.000Z","likes":369,"trendingScore":2,"private":false,"sha":"e92b25a616738fe95dc186b64dfb19f9c8525594","description":"\n\t\n\t\t\n\t\t📐 FineMath\n\t\n\n\n\n\t\n\t\t\n\t\tWhat is it?\n\t\n\n📐 FineMath consists of 34B tokens (FineMath-3+) and 54B tokens (FineMath-3+ with InfiMM-WebMath-3+) of mathematical educational content filtered from CommonCrawl. To curate this dataset, we trained a mathematical content classifier using annotations generated by LLama-3.1-70B-Instruct. We used the classifier to retain only the most educational mathematics content, focusing on clear explanations and step-by-step problem solving rather than… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/finemath.","downloads":23007,"tags":["license:odc-by","size_categories:10M<n<100M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2502.02737","doi:10.57967/hf/3847","region:us"],"createdAt":"2024-11-25T15:23:13.000Z","key":""},{"_id":"67482ec5e9d3466929bc50af","id":"defeatbeta/yahoo-finance-data","author":"defeatbeta","disabled":false,"gated":false,"lastModified":"2026-06-30T07:01:05.000Z","likes":100,"trendingScore":2,"private":false,"sha":"ec6572a5f16d61e78399d08b2c12c70378a7a346","description":"\n\t\n\t\t\n\t\n\t\n\t\tThe Financial data from Yahoo!\n\t\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t*** Key Points to Note ***\n\t\n\n\nAll financial data is sourced from Yahoo!Ⓡ Finance, Nasdaq!Ⓡ, and the U.S. Department of the Treasury via publicly available APIs, and is intended for research and educational purposes.\nI will update the data regularly, and you are welcome to follow this project and use the data.\nEach time the data is updated, I will record the update time in spec.json.\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tData Usage Instructions\n\t\n\nUse DuckDB… See the full description on the dataset page: https://huggingface.co/datasets/defeatbeta/yahoo-finance-data.","downloads":136658,"tags":["language:en","license:odc-by","size_categories:100M<n<1B","region:us","earnings-call-transcripts","market-data","stock-data","finance-data","finance","stock-news","yahoo-news"],"createdAt":"2024-11-28T08:50:13.000Z","key":""},{"_id":"675d7e29e24babdf1842d270","id":"m-a-p/FineFineWeb","author":"m-a-p","disabled":false,"gated":false,"lastModified":"2024-12-19T11:34:03.000Z","likes":152,"trendingScore":2,"private":false,"sha":"7fd92dc825a75cbff271a5a52eea0eda91a2c112","description":"\n\t\n\t\t\n\t\tFineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus\n\t\n\narXiv: Coming Soon\nProject Page: Coming Soon\nBlog: Coming Soon\n\n\t\n\t\t\n\t\tData Statistics\n\t\n\n\n\t\n\t\t\nDomain (#tokens/#samples)\nIteration 1 Tokens\nIteration 2 Tokens\nIteration 3 Tokens\nTotal Tokens\nIteration 1 Count\nIteration 2 Count\nIteration 3 Count\nTotal Count\n\n\n\t\t\naerospace\n5.77B\n261.63M\n309.33M\n6.34B\n9100000\n688505\n611034\n10399539\n\n\nagronomy\n13.08B\n947.41M\n229.04M\n14.26B\n15752828\n2711790\n649404\n19114022\n\n\nartistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.","downloads":1243007,"tags":["task_categories:text-classification","task_categories:text-generation","language:en","license:apache-2.0","size_categories:1B<n<10B","modality:tabular","modality:text","region:us"],"createdAt":"2024-12-14T12:46:33.000Z","key":""},{"_id":"6763e94724dee5a47c7c77f7","id":"agibot-world/AgiBotWorld-Alpha","author":"agibot-world","disabled":false,"gated":"auto","lastModified":"2025-09-29T11:38:00.000Z","likes":225,"trendingScore":2,"private":false,"sha":"128665c9e0244c45d1cbe5c13f5a4706afd24f27","description":"\n\n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n\n\n\n\t\t\n\t\t⚠️Important Notice !!!\n\t\n\nDear Users,\nThe Alpha Dataset has been updated as follows:\n\nFrame Loss Data Removal: Several episodes with frame loss issues have been removed. For the complete list of removed episode IDs, please refer to this document.\nChanges in Episode Count: The updated Alpha Dataset retains the original 36 tasks. The new version has been enriched with additional interactive objects, extending the total duration from 474.12… See the full description on the dataset page: https://huggingface.co/datasets/agibot-world/AgiBotWorld-Alpha.","downloads":9326,"tags":["task_categories:robotics","task_categories:other","language:en","size_categories:10M<n<100M","format:webdataset","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us","real-world","dual-arm","Robotics manipulation"],"createdAt":"2024-12-19T09:37:11.000Z","key":""},{"_id":"676f70846bf205795346d2be","id":"FreedomIntelligence/medical-o1-reasoning-SFT","author":"FreedomIntelligence","disabled":false,"gated":false,"lastModified":"2025-04-22T15:11:21.000Z","likes":1130,"trendingScore":2,"private":false,"sha":"fc2c9e8a37b38f38da6d449564a8c350b244aef4","description":"\n\t\n\t\t\n\t\tNews\n\t\n\n[2025/04/22] We split the data and kept only the medical SFT dataset (medical_o1_sft.json). The file medical_o1_sft_mix.json contains a mix of medical and general instruction data.\n[2025/02/22] We released the distilled dataset from Deepseek-R1 based on medical verifiable problems. You can use it to initialize your models with the reasoning chain from Deepseek-R1.\n[2024/12/25] We open-sourced the medical reasoning dataset for SFT, built on medical verifiable problems and an LLM… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT.","downloads":9912,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","language:zh","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2412.18925","region:us","medical","biology"],"createdAt":"2024-12-28T03:29:08.000Z","key":""},{"_id":"6785bb86c3b883cd1ad320ec","id":"awsaf49/epic_kitchens_100","author":"awsaf49","disabled":false,"gated":false,"lastModified":"2025-01-14T03:23:01.000Z","likes":25,"trendingScore":2,"private":false,"sha":"c5e573b9295c06bd6d14bd5a9b9c7b178f00d7b1","description":"\n\t\n\t\t\n\t\tMotivation\n\t\n\nThe actual download link is very slow, including the academic torrent. Therefore, to spare fellow community members from this misery, I am uploading the dataset here.\n\n\t\n\t\t\n\t\tSource\n\t\n\nYou can fnd the original source to download the dataset: https://github.com/epic-kitchens/epic-kitchens-download-scripts\n\n\t\n\t\t\n\t\tCitation\n\t\n\n@INPROCEEDINGS{Damen2018EPICKITCHENS,\n   title={Scaling Egocentric Vision: The EPIC-KITCHENS Dataset},\n   author={Damen, Dima and Doughty, Hazel and… See the full description on the dataset page: https://huggingface.co/datasets/awsaf49/epic_kitchens_100.","downloads":9056,"tags":["task_categories:voice-activity-detection","task_categories:video-classification","license:cc-by-nc-4.0","size_categories:n<1K","modality:video","modality:audio","region:us","epic_kitchen","video","audio","epic_kitchens_100","dataset"],"createdAt":"2025-01-14T01:19:02.000Z","key":""},{"_id":"6795a5ea495916be7c8a6017","id":"newfacade/LeetCodeDataset","author":"newfacade","disabled":false,"gated":false,"lastModified":"2025-05-29T09:00:24.000Z","likes":74,"trendingScore":2,"private":false,"sha":"215604aeed660029df7de2fea5a4d7b6ed476a08","description":"\n\t\n\t\t\n\t\tLeetCodeDataset\n\t\n\nLeetCodeDataset is a dataset consists of Python leetcode problems that can be used for LLM training and evaluation.\n\n    💻 GitHub \n    📄 LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs  \n    📄 Policy Filtration for RLHF to Mitigate Noise in Reward Models \n\n","downloads":2121,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2504.14655","arxiv:2409.06957","region:us","code"],"createdAt":"2025-01-26T03:03:06.000Z","key":""},{"_id":"6799c7f5754836e22dc052ec","id":"llm-jp/AnswerCarefully","author":"llm-jp","disabled":false,"gated":"auto","lastModified":"2026-02-16T03:52:54.000Z","likes":56,"trendingScore":2,"private":false,"sha":"71582a7bbf2096a3d7d2de33b215570cced1be9f","description":"\n\t\n\t\t\n\t\tAnswerCarefully\n\t\n\n概要\nAnswerCarefullyは日本語LLM 出力の安全性・適切性に特化したインストラクションデータセットです。\nこのデータセットは、英語の要注意回答を集めた Do-Not-Answer データセット の包括的なカテゴリ分類に基づき、人手で質問・回答ともに日本語サンプルを集めたオリジナルのデータセットです。\nデータセットの詳細については、こちらをご覧ください。\nOverview\nAnswerCarefully is an instruction dataset specifically aimed at ensuring safety and appropriateness of LLM output in Japanese.\nThis dataset consists of original pairs of questions and reference (safe) responses based on the extensive safety taxonomy proposed in Do-Not-Answer… See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/AnswerCarefully.","downloads":854,"tags":["language:ja","language:en","license:other","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2506.02372","region:us"],"createdAt":"2025-01-29T06:17:25.000Z","key":""},{"_id":"67a9f247188f29a956a34a04","id":"AI-MO/NuminaMath-1.5","author":"AI-MO","disabled":false,"gated":false,"lastModified":"2026-01-29T11:00:49.000Z","likes":190,"trendingScore":2,"private":false,"sha":"1b05109f9e5c1ad06c0663519502416c30b300f8","description":"\n\t\n\t\t\n\t\tDataset Card for NuminaMath 1.5\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis is the second iteration of the popular NuminaMath dataset, bringing high quality post-training data for approximately 900k competition-level math problems.  Each solution is formatted in a Chain of Thought (CoT) manner. The sources of the dataset range from Chinese high school math exercises to US and international mathematics olympiad competition problems. The data were primarily collected from online exam paper PDFs… See the full description on the dataset page: https://huggingface.co/datasets/AI-MO/NuminaMath-1.5.","downloads":5404,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","math","post-training"],"createdAt":"2025-02-10T12:34:15.000Z","key":""},{"_id":"67aa021ced8d8663d42505cc","id":"open-r1/OpenR1-Math-220k","author":"open-r1","disabled":false,"gated":false,"lastModified":"2025-02-18T11:45:27.000Z","likes":762,"trendingScore":2,"private":false,"sha":"e4e141ec9dea9f8326f4d347be56105859b2bd68","description":"\n\t\n\t\t\n\t\tOpenR1-Math-220k\n\t\n\n\n\t\n\t\t\n\t\tDataset description\n\t\n\nOpenR1-Math-220k is a large-scale dataset for mathematical reasoning. It consists of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5. \nThe traces were verified using Math Verify for most samples and Llama-3.3-70B-Instruct as a judge for 12% of the samples, and each problem contains at least one reasoning trace with a correct answer.\nThe dataset consists of two splits:… See the full description on the dataset page: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k.","downloads":39222,"tags":["language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-02-10T13:41:48.000Z","key":""},{"_id":"67ab2149bc76a9aab3ff90ed","id":"agibot-world/AgiBotWorld-Beta","author":"agibot-world","disabled":false,"gated":"auto","lastModified":"2025-10-13T10:01:35.000Z","likes":67,"trendingScore":2,"private":false,"sha":"2252b3fdc88dc67fa5d5e4b771e1b86f5e71c278","description":"\n\n  \n    \n  \n  \n    \n  \n  \n    \n  \n  \n    \n  \n\n\n\n\n\t\t\n\t\tKey Features 🔑\n\t\n\n\n1 million+ trajectories from 100 robots, with a total duration of 2976.4 hours.\n100+ real-world scenarios across 5 target domains.\nCutting-edge hardware: visual tactile sensors / 6-DoF dexterous hand / mobile dual-arm robots\n200+ types of tasks:\nContact-rich manipulation\nLong-horizon planning\nMulti-robot collaboration\n\n\n87 types of Atomic Skills, including Tie, OpenJar, Peel, Sweep etc.\n\n\n    \n        \n        Your… See the full description on the dataset page: https://huggingface.co/datasets/agibot-world/AgiBotWorld-Beta.","downloads":33126,"tags":["task_categories:other","task_categories:robotics","language:en","size_categories:100M<n<1B","format:webdataset","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us","real-world","dual-arm","Robotics manipulation"],"createdAt":"2025-02-11T10:07:05.000Z","key":""},{"_id":"67b78333f663232795e6cb29","id":"SynthLabsAI/Big-Math-RL-Verified","author":"SynthLabsAI","disabled":false,"gated":"auto","lastModified":"2025-03-25T15:33:48.000Z","likes":235,"trendingScore":2,"private":false,"sha":"c75d2f117cddfecb6bd08756e61e508e59732b21","description":"\n\t\n\t\t\n\t\tBig-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models\n\t\n\nBig-Math is the largest open-source dataset of high-quality mathematical problems, curated specifically for reinforcement learning (RL) training in language models. With over 250,000 rigorously filtered and verified problems, Big-Math bridges the gap between quality and quantity, establishing a robust foundation for advancing reasoning in LLMs.\n\n  \n    Request Early Access to Private… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified.","downloads":5120,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2502.17387","region:us","mathematics","math","reinforcement-learning","RL","reasoning","verifiable","open-ended-questions","closed-form-answers"],"createdAt":"2025-02-20T19:32:03.000Z","key":""},{"_id":"67b8973e95da0a82c8b74294","id":"datasetmaster/resumes","author":"datasetmaster","disabled":false,"gated":false,"lastModified":"2025-02-21T16:06:40.000Z","likes":17,"trendingScore":2,"private":false,"sha":"3a53af8a0ac0eb7a6a0d83aca29d6e3fc103c446","description":"\n\t\n\t\t\n\t\tDataset Card for Advanced Resume Parser & Job Matcher Resumes\n\t\n\nThis dataset contains a merged collection of real and synthetic resume data in JSON format. The resumes have been normalized to a common schema to facilitate the development of NLP models for candidate-job matching in the technical recruitment domain.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThis dataset is a combined collection of real resumes and synthetically generated CVs. \n\nCurated by: datasetmaster… See the full description on the dataset page: https://huggingface.co/datasets/datasetmaster/resumes.","downloads":1843,"tags":["task_categories:token-classification","language:en","license:mit","size_categories:1K<n<10K","region:us","resumes","NLP","synthetic","real-world","recruitment","job matching"],"createdAt":"2025-02-21T15:09:50.000Z","key":""},{"_id":"67c37444fa7bfa443b9d8ad8","id":"Youseff1987/multilingual_translation_sft","author":"Youseff1987","disabled":false,"gated":false,"lastModified":"2025-03-01T20:58:01.000Z","likes":4,"trendingScore":2,"private":false,"sha":"e612c4dd2ce47dd848c7e7b42f560b77554e91e4","downloads":48,"tags":["task_categories:translation","language:ko","language:en","language:zh","language:zu","language:ja","language:am","language:ar","language:es","language:fr","language:ru","language:de","language:it","language:pt","language:nl","language:sv","language:tr","language:id","language:vi","language:pl","language:cs","language:ro","language:uk","language:hu","language:sl","language:el","language:fi","language:no","language:da","language:bg","language:hi","language:he","language:ms","language:ta","language:te","language:pa","language:bn","language:fa","language:sw","language:th","language:sr","language:hr","language:ca","language:is","language:lv","language:lt","language:sk","language:et","language:mn","language:la","language:my","language:tl","language:jv","language:mr","language:gu","language:ps","language:sd","language:kn","language:ml","language:ha","language:yo","language:ig","language:ber","license:mit","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-03-01T20:55:32.000Z","key":""},{"_id":"67c8647d4bcbc048532aec29","id":"ai4bharat/Rasa","author":"ai4bharat","disabled":false,"gated":"auto","lastModified":"2026-06-06T14:50:25.000Z","likes":40,"trendingScore":2,"private":false,"sha":"632f55c7ac590219d41cd7adffce5b440e4604f5","description":"\n\t\n\t\t\n\t\n\t\n\t\tRasa: Towards Building an Expressive Multilingual Text-To-Speech Dataset for Indian Languages\n\t\n\nFunded by: Bhashini, Ministry of Electronics and Information Technology, Government of IndiaSupported by: EkStep Foundation and Nilekani Philanthropies  \n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nWe introduce Rasa, the first high-quality multilingual expressive Text-to-Speech (TTS) dataset for any Indian language. It comprises a minimum of 20 hours per speaker with a target of covering \na female and male… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/Rasa.","downloads":3904,"tags":["task_categories:text-to-speech","language:as","language:bn","language:kn","language:ml","language:mr","language:ne","language:ta","language:pa","language:te","language:sa","language:ur","language:ks","language:sd","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-03-05T14:49:33.000Z","key":""},{"_id":"67d258dd99a3ae46e6826e1a","id":"yifengzhu-hf/LIBERO-datasets","author":"yifengzhu-hf","disabled":false,"gated":false,"lastModified":"2025-05-18T17:02:51.000Z","likes":60,"trendingScore":2,"private":false,"sha":"f13aa24a3da8c43c7225569f28c562979fa0e35a","description":"\n\t\n\t\t\n\t\tLIBERO Datasets\n\t\n\nThis is a repo that stores the LIBERO datasets. The structure of the dataset can be found below:\nlibero_object/\nlibero_spatial/\nlibero_goal/\nlibero_90/\nlibero_10/\n\nDemonstrations of each task is stored in a hdf5 file. Please refer to download script from the official LIBERO repo for more details. \n","downloads":24153,"tags":["license:apache-2.0","region:us"],"createdAt":"2025-03-13T04:02:37.000Z","key":""},{"_id":"67d97c4be2b27852325fd8e2","id":"nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-03-05T23:36:40.000Z","likes":236,"trendingScore":2,"private":false,"sha":"ea7ac0b68f87da62f1e726771bba0fe74300802f","description":"\n\t\n\t\t\n\t\tPhysicalAI-Robotics-GR00T-X-Embodiment-Sim\n\t\n\n\nGithub Repo: Isaac GR00T N1\nWe provide a set of datasets used for post-training of GR00T N1. Each dataset is a collection of trajectories from different robot embodiments and tasks.\n\n\t\n\t\t\n\t\n\t\n\t\tCross-embodied bimanual manipulation: 9k trajectories\n\t\n\n\n\t\n\t\t\nDataset Name\n#trajectories\n\n\n\t\t\nbimanual_panda_gripper.Threading\n1000\n\n\nbimanual_panda_hand.LiftTray\n1000\n\n\nbimanual_panda_gripper.ThreePieceAssembly\n1000… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim.","downloads":677161,"tags":["task_categories:robotics","license:cc-by-4.0","region:us","robotics"],"createdAt":"2025-03-18T13:59:39.000Z","key":""},{"_id":"67ecf302a089ff054aa518b1","id":"open-r1/Big-Math-RL-Verified-Processed","author":"open-r1","disabled":false,"gated":false,"lastModified":"2025-04-11T18:30:23.000Z","likes":29,"trendingScore":2,"private":false,"sha":"c79efbb6d3b75e3a2bcc27a5c569119918132345","description":"\n\t\n\t\t\n\t\tDataset Card for Big-Math-RL-Verified-Processed\n\t\n\nThis is a processed version of SynthLabsAI/Big-Math-RL-Verified where we have applied the following filters:\n\nRemoved samples where llama8b_solve_rate is None\nRemoved samples that could not be parsed by math-verify (empty lists)\n\nWe have also created 5 additional subsets to indicate difficulty level, similar to the MATH dataset. To do so, we computed quintiles on the llama8b_solve_rate values and then filtered the dataset into the… See the full description on the dataset page: https://huggingface.co/datasets/open-r1/Big-Math-RL-Verified-Processed.","downloads":414,"tags":["size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2502.17387","region:us"],"createdAt":"2025-04-02T08:19:14.000Z","key":""},{"_id":"6805ce84dbba4c5505ce0407","id":"Slait/russia_voices","author":"Slait","disabled":false,"gated":false,"lastModified":"2025-04-21T05:14:13.000Z","likes":11,"trendingScore":2,"private":false,"sha":"215d799f529bfe8815063064246b5c09e4b380d6","description":"Russian voices for train AI.\n♀ Male and ♂ Female\nMale voices - 497 pcs.\nFemale voices - 244 pcs.\n\nPrepared for training fish-speech\n\nThe author is not responsible for the votes.\nUse at your own risk.\n\n\n\n\t\n\t\t\n\t\tlicense: apache-2.0\ntask_categories:\n- zero-shot-classification\nlanguage:\n- ru\nsize_categories:\n- 1B<n<10B\n\t\n\n","downloads":3344,"tags":["modality:audio","modality:text","region:us"],"createdAt":"2025-04-21T04:50:12.000Z","key":""},{"_id":"680d4f7c49d5d8a9ad685eb8","id":"vectara/open_ragbench","author":"vectara","disabled":false,"gated":false,"lastModified":"2025-06-12T20:49:04.000Z","likes":19,"trendingScore":2,"private":false,"sha":"63f6b052ff83508b08e242db42263ee708815c26","description":"\n\t\n\t\t\n\t\tOpen RAG Benchmark\n\t\n\nThe Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes pure PDF content, meticulously extracting and generating queries on diverse modalities including text, tables, and images, even when they are intricately interwoven within a… See the full description on the dataset page: https://huggingface.co/datasets/vectara/open_ragbench.","downloads":886,"tags":["license:cc-by-nc-4.0","region:us"],"createdAt":"2025-04-26T21:26:20.000Z","key":""},{"_id":"681139b8ff0764f384f0b38e","id":"SWE-bench/SWE-bench_Verified","author":"SWE-bench","disabled":false,"gated":false,"lastModified":"2026-02-27T20:36:38.000Z","likes":98,"trendingScore":2,"private":false,"sha":"91aa3ed51b709be6457e12d00300a6a596d4c6a3","description":"Dataset Summary\nSWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process.\nThe dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.\nThe original… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.","downloads":64913,"tags":["benchmark:official","benchmark:eval-yaml","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-04-29T20:42:32.000Z","key":""},{"_id":"681193b896da70073b6f4075","id":"YLab-Open/BRIDGE-Open","author":"YLab-Open","disabled":false,"gated":"auto","lastModified":"2025-08-16T23:00:44.000Z","likes":12,"trendingScore":2,"private":false,"sha":"41159677fba2454e032574625e24b24d29a17684","description":"\n  \n  \n  \n  \n\n\n📜 BRIDGE-Open Dataset\nThe BRIDGE-Open dataset is an open-access subset of the datasets used by the BRIDGE benchmark. This dataset includes the 55 tasks of the open-access datasets that can be released to the public. Each dataset should be used in accordance with the license of its original release (the source of each dataset is listed at our BRIDGE paper, Supplementary Section 5.) \n\nDue to privacy and security considerations of clinical data, regulated-access datasets can not… See the full description on the dataset page: https://huggingface.co/datasets/YLab-Open/BRIDGE-Open.","downloads":36,"tags":["arxiv:2504.19467","doi:10.57967/hf/8981","region:us"],"createdAt":"2025-04-30T03:06:32.000Z","key":""},{"_id":"6813714c21c604f71138e81d","id":"VedantPadwal/quantitative-finance-reasoning","author":"VedantPadwal","disabled":false,"gated":false,"lastModified":"2025-05-01T14:37:01.000Z","likes":7,"trendingScore":2,"private":false,"sha":"0cf079a2a527c003dba4b4d4215d7b935eebe362","description":"\n\t\n\t\t\n\t\tDataset Description\n\t\n\nDataset Summary\nThis dataset contains question-answer pairs focused on quantitative finance, covering topics such as option pricing, stochastic calculus (Brownian motion, Itô's Lemma), probability theory, and financial modeling assumptions. Each instance includes a question, a detailed ground-truth solution (often resembling textbook explanations or interview answers), a multi-step reasoning trace generated by a Gemini Pro model, and a structured validation of… See the full description on the dataset page: https://huggingface.co/datasets/VedantPadwal/quantitative-finance-reasoning.","downloads":83,"tags":["size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","reasoning-datasets-competition"],"createdAt":"2025-05-01T13:04:12.000Z","key":""},{"_id":"6820fb77b82e61bb50999662","id":"open-r1/Mixture-of-Thoughts","author":"open-r1","disabled":false,"gated":false,"lastModified":"2025-05-26T15:25:56.000Z","likes":318,"trendingScore":2,"private":false,"sha":"e55fa28006c0d0ec60fb3547520f775dd42d02cd","description":"\n\n\n\t\n\t\t\n\t\tDataset summary\n\t\n\nMixture-of-Thoughts is a curated dataset of 350k verified reasoning traces distilled from DeepSeek-R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. It was used in the Open R1 project to train OpenR1-Distill-7B, an SFT model that replicates the reasoning capabilities of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B from the same base model.\nTo load the dataset, run:\nfrom datasets import… See the full description on the dataset page: https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts.","downloads":6433,"tags":["task_categories:text-generation","language:en","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2504.21318","arxiv:2505.00949","region:us"],"createdAt":"2025-05-11T19:33:11.000Z","key":""},{"_id":"68217a1ed82591f9946f62b9","id":"BLIP3o/BLIP3o-60k","author":"BLIP3o","disabled":false,"gated":false,"lastModified":"2025-05-25T18:15:37.000Z","likes":38,"trendingScore":2,"private":false,"sha":"f7316b0aa446338ee1707484924aa59457b4bbf3","description":"This is BLIP3o-60k Text-to-Image instruction tuning dataset distilled from GPT-4o, including the following categories:\n\nJourneyDB\nHuman (including MSCOCO with human caption, human gestures, occupations)\nDalle3\nGeneval (no overlap with test set)\nCommon objects\nSimple text\n\nHere we provide the code guidance to download tar file:\nfrom huggingface_hub import snapshot_download\nsnapshot_download(repo_id='BLIP3o/BLIP3o-60k', repo_type=‘dataset’)\n\nAnd you can use huggingface datasets to read the tar… See the full description on the dataset page: https://huggingface.co/datasets/BLIP3o/BLIP3o-60k.","downloads":1272,"tags":["language:en","license:apache-2.0","size_categories:1K<n<10K","format:text","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2025-05-12T04:33:34.000Z","key":""},{"_id":"682251c05bdc4bcb49a3c183","id":"google/svq","author":"google","disabled":false,"gated":false,"lastModified":"2026-06-25T16:52:08.000Z","likes":51,"trendingScore":2,"private":false,"sha":"4e1b9a700752aca45768493638968084b925adb0","description":"\n\t\n\t\t\n\t\n\t\n\t\tSimple Voice Questions\n\t\n\nSimple Voice Questions (SVQ) is a set of short audio questions recorded in 26 locales across 17 languages under multiple audio conditions. It serves as a core evaluation componenet for Massive Sound Embedding Benchmark (MSEB).\n\n\t\n\t\t\n\t\n\t\n\t\tTechnical Specifications\n\t\n\n\n\t\n\t\t\nFeature\nDetails\n\n\n\t\t\nLocales\n26\n\n\nLanguages\n17\n\n\nTotal Speakers\n~700 (Capped at 250 recordings per speaker)\n\n\nAudio Conditions\nClean, Background Speech, Media, Traffic Noise\n\n\nGender… See the full description on the dataset page: https://huggingface.co/datasets/google/svq.","downloads":1816,"tags":["task_categories:question-answering","task_categories:automatic-speech-recognition","language:ar","language:bn","language:en","language:fi","language:gu","language:hi","language:id","language:ja","language:kn","language:ko","language:ml","language:mr","language:ru","language:sw","language:ta","language:te","language:ur","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-05-12T19:53:36.000Z","key":""},{"_id":"6825a822ee98cb4abe490553","id":"fingertap/GPQA-Diamond","author":"fingertap","disabled":false,"gated":false,"lastModified":"2025-05-15T08:39:02.000Z","likes":10,"trendingScore":2,"private":false,"sha":"68be7564497676e07a77a042fdb587deb88c51c3","downloads":6312,"tags":["size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-05-15T08:38:58.000Z","key":""},{"_id":"6826e915875b575a7accbf93","id":"SingleBicycle/4KLSDB","author":"SingleBicycle","disabled":false,"gated":false,"lastModified":"2026-05-26T05:18:27.000Z","likes":10,"trendingScore":2,"private":false,"sha":"2ab06f3da913268378cb348c88b8b361f60f114f","description":"\n\t\n\t\t\n\t\t4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation\n\t\n\nDataCV @ CVPR 2026 · Accepted 🎉\n\n\n  \n  \n  \n  \n  \n\n\n4KLSDB is a native-4K image dataset with 129,484 train / 2,000 val / 1,984 test images, spanning nature, urban scenes, people, food, artwork, CGI, animals, and architecture. It supports both image restoration (super-resolution) and 4K text-to-image generation.\nQuick links · 🌐 Project page · 💻 Code (GitHub) · 📄 Paper (arXiv) · 🤗 Dataset · 🧱 Checkpoints… See the full description on the dataset page: https://huggingface.co/datasets/SingleBicycle/4KLSDB.","downloads":10293,"tags":["task_categories:image-to-image","task_categories:text-to-image","language:en","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2605.24762","region:us","4K","super-resolution","text-to-image","diffusion","restoration","native-4K"],"createdAt":"2025-05-16T07:28:21.000Z","key":""},{"_id":"6836396519543f12e84cc686","id":"imageomics/TreeOfLife-200M","author":"imageomics","disabled":false,"gated":false,"lastModified":"2026-05-29T20:56:36.000Z","likes":35,"trendingScore":2,"private":false,"sha":"5f2dc493b3dc0e544438a04038ab15faa646b749","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for TreeOfLife-200M\n\t\n\nIf you are looking for the original release TreeOfLife-200M dataset, as used in training BioCLIP 2 and presented the paper, please see Revision a8f38b4. The dataset, as presented here, was used to train BioCLIP 2.5 Huge; it completes the dataset cleaning process and resolves an issue where Observation.org occurrences were not included in the training data.\nWith 233 million images representing 933,798 taxa across the tree of life, TreeOfLife-200M… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/TreeOfLife-200M.","downloads":9278,"tags":["task_categories:image-classification","task_categories:zero-shot-classification","language:en","language:la","license:cc0-1.0","size_categories:100M<n<1B","modality:image","modality:text","doi:10.57967/hf/8980","region:us","biology","image","imageomics","animals","plants","fungi","evolutionary biology","CV","multimodal","clip","species","taxonomy","knowledge-guided","imbalanced"],"createdAt":"2025-05-27T22:15:01.000Z","key":""},{"_id":"6837854ff36dbe5068b5d602","id":"open-thoughts/OpenThoughts3-1.2M","author":"open-thoughts","disabled":false,"gated":false,"lastModified":"2025-06-09T16:14:06.000Z","likes":244,"trendingScore":2,"private":false,"sha":"61bcf9d4eb38b30295efc2021227a63cc5bb34c8","description":"\n    \n\n\n\npaper |\ndataset |\nmodel\n\n\n\n[!NOTE]\nWe have released a paper for OpenThoughts! See our paper here.\n\n\n \n\n\n\n\n\t\n\t\n\t\n\t\tOpenThoughts3-1.2M\n\t\n\nOpen-source state-of-the-art reasoning dataset with 1.2M rows. 🚀\nOpenThoughts3-1.2M is the third iteration in our line of OpenThoughts datasets, building on our previous OpenThoughts-114k and OpenThoughts2-1M.\nThis time around, we scale even further and generate our dataset in a much more systematic way -- OpenThoughts3-1.2M is the result of a… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M.","downloads":18522,"tags":["task_categories:text-generation","license:apache-2.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2506.04178","region:us","reasoning","mathematics","code","science"],"createdAt":"2025-05-28T21:51:11.000Z","key":""},{"_id":"6848202ce6d8618ef50ca37e","id":"nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams","author":"nvidia","disabled":false,"gated":false,"lastModified":"2025-06-15T12:40:08.000Z","likes":58,"trendingScore":2,"private":false,"sha":"b405b5dea2a4e2e014d7c77d5ea5a28e77febf56","description":"\n\t\n\t\t\n\t\tPhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams\n\t\n\nPaper | Paper Website | GitHub\n\n\n\t\n\t\t\n\t\n\t\n\t\tDownload\n\t\n\nWe provide a download script to download our dataset. If you have enough space, you can use git to download a dataset from huggingface.\nusage: download.py [-h] --odir ODIR\n                                       [--file_types {hdmap,lidar,synthetic}[,…]]\n                                       [--workers N] [--clean_cache]\n\nrequired arguments:\n  --odir ODIR            Output… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams.","downloads":11481,"tags":["task_categories:robotics","language:en","license:cc-by-4.0","size_categories:n>1T","arxiv:2506.09042","region:us","Video","physicalAI","AV"],"createdAt":"2025-06-10T12:08:12.000Z","key":""},{"_id":"68533feba891f471f3f9c808","id":"nvidia/cvdp-benchmark-dataset","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-08T19:31:34.000Z","likes":29,"trendingScore":2,"private":false,"sha":"39fa3f3971482fa54dc3cb86e23168fdb87a6650","description":"Important please see \"Files and versions\" above for full list of files in the CVDP dataset.\nPlease see LICENSE and NOTICE for licensing information. See CHANGELOG for changes.\nThis is the Comprehensive Verilog Design Problems (CVDP) benchmark dataset to use with the CVDP infrastructure on GitHub.\n","downloads":1403,"tags":["license:other","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-06-18T22:38:35.000Z","key":""},{"_id":"685584aa0077eddadd99b004","id":"neulab/agent-data-collection","author":"neulab","disabled":false,"gated":false,"lastModified":"2026-03-09T18:54:05.000Z","likes":115,"trendingScore":2,"private":false,"sha":"31a76bfb0124d77ae7322eabbb0171bf11ee2c67","description":"\n\t\n\t\t\n\t\tAgent Data Collection\n\t\n\nA comprehensive collection of agent interaction datasets for training and evaluating AI agents across diverse domains and tasks. \nThis dataset aggregates high-quality agent trajectories from various environments including web browsing, code generation, household tasks, knowledge base querying, and software engineering.\nThe dataset is collected through methods described in Agent Data Protocol.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Splits\n\t\n\nEach dataset configuration provides up… See the full description on the dataset page: https://huggingface.co/datasets/neulab/agent-data-collection.","downloads":3235,"tags":["task_categories:text-generation","language:en","size_categories:1M<n<10M","arxiv:2510.24702","region:us","agent","multi-turn","tool-use","reasoning","interactive"],"createdAt":"2025-06-20T15:56:26.000Z","key":""},{"_id":"68661a935dbd7fc7b95b95a9","id":"MikePfunk28/resume-training-dataset","author":"MikePfunk28","disabled":false,"gated":"auto","lastModified":"2025-07-06T19:16:12.000Z","likes":6,"trendingScore":2,"private":false,"sha":"bd167fcb1dc281707c8ca7a2d4aa822f63650d74","description":"\n\t\n\t\t\n\t\tResume Training Dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset contains 22,855 curated resume samples designed for training AI models on resume analysis, generation, and career development tasks. Each entry includes structured conversations between users seeking resume help and AI assistants providing feedback, making it ideal for training models to understand professional writing patterns, critique resumes, and suggest improvements.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tSupported… See the full description on the dataset page: https://huggingface.co/datasets/MikePfunk28/resume-training-dataset.","downloads":31,"tags":["task_categories:feature-extraction","language:en","language:vi","language:zh","license:mit","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","resume","resumes","images","text","summaries"],"createdAt":"2025-07-03T05:52:19.000Z","key":""},{"_id":"68669d92a6fbdd36f06c720f","id":"MathLLMs/ImgCode-8.6M","author":"MathLLMs","disabled":false,"gated":false,"lastModified":"2025-10-11T06:03:09.000Z","likes":24,"trendingScore":2,"private":false,"sha":"e956fcc2c51c8dd6bde1485ceea670f183141bd7","description":"\n\t\n\t\t\n\t\tMathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning\n\t\n\nRepo: https://github.com/mathllm/MathCoder\nPaper: https://huggingface.co/papers/2505.10557\n\n\t\n\t\t\n\t\tIntroduction\n\t\n\nWe introduce MathCoder-VL, a series of open-source large multimodal models (LMMs) specifically tailored for general math problem-solving. We also introduce FigCodifier-8B, an image-to-code model.\n\n\t\n\t\t\nBase Model\nOurs\n\n\n\t\t\nMini-InternVL-Chat-2B-V1-5\nMathCoder-VL-2B\n\n\nInternVL2-8B… See the full description on the dataset page: https://huggingface.co/datasets/MathLLMs/ImgCode-8.6M.","downloads":2912,"tags":["task_categories:image-to-text","task_categories:text-generation","task_categories:image-text-to-text","task_categories:visual-question-answering","language:en","license:apache-2.0","size_categories:1M<n<10M","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2505.10557","region:us","Geometry","Diagrams","Charts","Tables","Graphs","Figures","Plots","Math"],"createdAt":"2025-07-03T15:11:14.000Z","key":""},{"_id":"6878963273bedf813f4fef37","id":"spatialverse/InteriorGS","author":"spatialverse","disabled":false,"gated":"auto","lastModified":"2026-01-18T14:34:28.000Z","likes":178,"trendingScore":2,"private":false,"sha":"5201ed9fd11fc2b8ac23e069796c386dbbf8f943","description":"\n\t\n\t\t\n\t\tInteriorGS: 3D Gaussian Splatting Dataset of Semantically Labeled Indoor Scenes\n\t\n\nA comprehensive indoor scene dataset featuring 3D Gaussian representations with semantic annotations and spatial occupancy information.\n\n    \n    \n\n\n\n  Sample from the InteriorGS dataset. The dataset provides high-quality 3D Gaussian Splatting (3DGS) representations along with instance-level semantic bounding boxes and occupancy maps indicating agent-accessible areas. The red and yellow trajectories… See the full description on the dataset page: https://huggingface.co/datasets/spatialverse/InteriorGS.","downloads":5665,"tags":["license:other","arxiv:2510.21307","region:us"],"createdAt":"2025-07-17T06:20:34.000Z","key":""},{"_id":"68873481d4a41fe542ba35b7","id":"uv-scripts/ocr","author":"uv-scripts","disabled":false,"gated":false,"lastModified":"2026-06-23T15:20:22.000Z","likes":141,"trendingScore":2,"private":false,"sha":"3a5d7edf47026b04e3d34b8e4c19f0e0f9f64a00","description":"\n\t\n\t\t\n\t\n\t\n\t\tOCR UV Scripts\n\t\n\n\n\nPart of uv-scripts — self-contained UV scripts you run on Hugging Face Jobs in one command.\n\nA model zoo of OCR scripts — one per model — that add a markdown column to an image dataset. Pick a model from the table below, point it at your dataset, and run it on a GPU with one command. A few recipes do structured extraction instead — image or text → JSON given a schema (see Structured extraction below). Two more companions sit alongside: pp-doclayout.py detects… See the full description on the dataset page: https://huggingface.co/datasets/uv-scripts/ocr.","downloads":1438,"tags":["arxiv:2605.27978","region:us","uv-script","ocr","extraction","vision-language-model","document-processing","hf-jobs"],"createdAt":"2025-07-28T08:27:45.000Z","key":""},{"_id":"688d3ee280216481fe4acef7","id":"universal-dependencies/universal_dependencies","author":"universal-dependencies","disabled":false,"gated":false,"lastModified":"2026-06-22T09:54:56.000Z","likes":6,"trendingScore":2,"private":false,"sha":"ab26265449dfaed5356ecb4ddf021816dfcf1b34","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card (v2.0) for Universal Dependencies Treebank\n\t\n\nVersion 2.0.0 introduces significant improvements and breaking changes:\n\nParquet Format: faster loading with HuggingFace datasets >=4.0.0\nMWT Support: New mwt field provides structured multi-word token information\nEnhanced Security: No more trust_remote_code=True required\nSeparate Versioning: Loader version (2.0.0) distinct from UD data version (2.17)\n\nBreaking Changes:\n\nToken sequences now exclude MWT surface forms… See the full description on the dataset page: https://huggingface.co/datasets/universal-dependencies/universal_dependencies.","downloads":10446,"paperswithcode_id":"universal-dependencies","tags":["task_categories:token-classification","task_ids:parsing","task_ids:part-of-speech","task_ids:lemmatization","annotations_creators:expert-generated","language_creators:crowdsourced","multilinguality:multilingual","source_datasets:original","language:abq","language:ab","language:af","language:akk","language:aqz","language:sq","language:gsw","language:am","language:grc","language:hbo","language:apu","language:ar","language:hy","language:as","language:aii","language:az","language:bm","language:eu","language:bar","language:bej","language:be","language:bn","language:bho","language:sab","language:bor","language:brh","language:br","language:bg","language:bxr","language:yue","language:cpg","language:ca","language:ceb","language:ckb","language:zh","language:ctn","language:ckt","language:xcl","language:lzh","language:cop","language:hr","language:cs","language:da","language:nl","language:egy","language:en","language:myv","language:eo","language:et","language:fo","language:fi","language:fr","language:qfn","language:gl","language:ka","language:de","language:aln","language:gor","language:got","language:el","language:gub","language:gn","language:gu","language:gwi","language:ht","language:ha","language:he","language:azz","language:hi","language:hit","language:hu","language:is","language:arh","language:id","language:ga","language:it","language:ja","language:jv","language:urb","language:kbc","language:xnr","language:krl","language:arr","language:kk","language:naq","language:quc","language:koi","language:kpv","language:ko","language:ky","language:ltg","language:la","language:lv","language:lij","language:lt","language:olo","language:nds","language:lb","language:mk","language:jaa","language:qaf","language:mpu","language:ml","language:mt","language:gv","language:mr","language:gun","language:axm","language:frm","language:mdf","language:myu","language:nmf","language:pcm","language:nap","language:yrk","language:ne","language:yrl","language:sme","language:kmr","language:gya","language:no","language:oc","language:or","language:cu","language:orv","language:ang","language:fro","language:oge","language:sga","language:pro","language:otk","language:ota","language:ps","language:pad","language:fa","language:pay","language:xpg","language:pl","language:qpm","language:pt","language:pa","language:ro","language:ru","language:ruc","language:sa","language:gd","language:sr","language:wuu","language:scn","language:sd","language:si","language:sms","language:sk","language:sl","language:ajp","language:sdh","language:es","language:ssp","language:sv","language:swl","language:tl","language:ta","language:tt","language:eme","language:te","language:qte","language:th","language:tn","language:tpn","language:tr","language:qti","language:qtd","language:uk","language:xum","language:hsb","language:ur","language:ug","language:uz","language:vep","language:vi","language:wbp","language:cy","language:hyw","language:nhi","language:wo","language:xav","language:sjo","language:sah","language:yi","language:yo","language:ess","language:say","language:zza","license:other","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2004.10643","region:us","text","constituency-parsing","dependency-parsing","part-of-speech-tagging"],"createdAt":"2025-08-01T22:25:38.000Z","key":""},{"_id":"68909c14982d3238ea39c35c","id":"xlangai/ubuntu_osworld_verified_trajs","author":"xlangai","disabled":false,"gated":false,"lastModified":"2026-06-10T09:21:41.000Z","likes":17,"trendingScore":2,"private":false,"sha":"5124c991419d4c40deebab2c317735302c1cbdc0","description":"\n\t\n\t\t\n\t\n\t\n\t\tOSWorld-Verified Model Trajectories\n\t\n\nThis repository contains trajectory results from various AI models evaluated on the OSWorld benchmark - a comprehensive evaluation environment for multimodal agents in real computer environments.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Overview\n\t\n\nThis dataset includes evaluation trajectories and results from multiple state-of-the-art models tested on OSWorld tasks.\n\n\t\n\t\t\n\t\n\t\n\t\tFile Structure\n\t\n\nEach zip file contains complete evaluation trajectories including:… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/ubuntu_osworld_verified_trajs.","downloads":1987,"tags":["license:mit","size_categories:100K<n<1M","region:us","code"],"createdAt":"2025-08-04T11:40:04.000Z","key":""},{"_id":"6891391f92f9e1788632d0ad","id":"KingOfThoughtFleuren/Aetherius-Genesis-Library","author":"KingOfThoughtFleuren","disabled":false,"gated":false,"lastModified":"2025-08-05T02:18:50.000Z","likes":2,"trendingScore":2,"private":false,"sha":"a0c1df1ceb5cdb890dfea9fc12eefe4a8647c5ea","downloads":34,"tags":["license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-08-04T22:50:07.000Z","key":""},{"_id":"6891ea33da98d151ae6c8d3c","id":"MCAA1-MSU/anv_data_ke","author":"MCAA1-MSU","disabled":false,"gated":"manual","lastModified":"2026-02-14T14:36:27.000Z","likes":14,"trendingScore":2,"private":false,"sha":"8267474a101db87360c3de9ca4117dcba4706b6d","description":"language:\n\nki\nso\nkln\nluo\nmas\n\n\n\t\n\t\t\n\t\tpretty_name: anv_ke\n\t\n\n⚠️ IMPORTANT: Work in ProgressThis dataset is not final. Updates will continue through September 2025.Please use the latest version for attribution, benchmarking and publications.\n\n\t\n\t\t\n\t\tOverview\n\t\n\nAfrican Next Voices: Pilot Data Collection in Kenya is part of a larger initiative to support African language speech technology. This project, funded by the Gates Foundation, is led by the KenCorpus Consortium, a coalition of Kenyan… See the full description on the dataset page: https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke.","downloads":782,"tags":["size_categories:100K<n<1M","modality:audio","modality:text","region:us"],"createdAt":"2025-08-05T11:25:39.000Z","key":""},{"_id":"689cca62d870fb1a8441783b","id":"nvidia/Nemotron-Post-Training-Dataset-v2","author":"nvidia","disabled":false,"gated":"auto","lastModified":"2025-08-21T04:29:18.000Z","likes":138,"trendingScore":2,"private":false,"sha":"5c89e01dd720ae0f4058445ed49c5fb68a03c76e","description":"\n\t\n\t\t\n\t\tNemotron-Post-Training-Dataset-v2 Release\n\t\n\n\n\t\n\t\t\n\t\tData Overview\n\t\n\nThis dataset adds to NVIDIA’s post-training dataset releases with an extension of SFT and RL data into five target languages: Spanish, French, German, Italian and Japanese. The data supports improvements of math, code, general reasoning, and instruction following capabilities of the NVIDIA-Nemotron-Nano-9B-v2-Base, in support of release of NVIDIA-Nemotron-Nano-8B-v2-Reasoning.\nNVIDIA-Nemotron-Nano-9B is a family of… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2.","downloads":7813,"tags":["language:en","language:de","language:it","language:fr","language:es","language:ja","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2508.14444","region:us"],"createdAt":"2025-08-13T17:24:50.000Z","key":""},{"_id":"689d9866627d0b303e0171d0","id":"nvidia/Nemotron-CC-Math-v1","author":"nvidia","disabled":false,"gated":"auto","lastModified":"2025-12-23T00:17:16.000Z","likes":89,"trendingScore":2,"private":false,"sha":"397a2502f2028c659ba411a6c4935b464a7f03aa","description":"\n\t\n\t\t\n\t\tNemotron-Pre-Training-Dataset-v1 Release\n\t\n\n👩‍💻 Authors: Rabeeh Karimi Mahabadi, Sanjeev Satheesh\n📘 Paper: Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset\n📝 Blog: Nemotron-cc-math blog\n\n\t\n\t\t\n\t\tData Overview\n\t\n\nWe’re excited to introduce Nemotron-CC-Math - a large-scale, high-quality math corpus extracted from Common Crawl which was used in nemotron pre-training. \nThis dataset is built to preserve and surface high-value mathematical and code content… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1.","downloads":60652,"tags":["task_categories:text-generation","license:other","size_categories:100M<n<1B","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2508.15096","arxiv:2410.12881","arxiv:2508.14444","region:us"],"createdAt":"2025-08-14T08:03:50.000Z","key":""},{"_id":"68a33a7797764e75205988d5","id":"lightonai/embeddings-pre-training","author":"lightonai","disabled":false,"gated":false,"lastModified":"2026-04-16T13:05:20.000Z","likes":48,"trendingScore":2,"private":false,"sha":"109b885788eb67398ae47c30841273d4dc692066","description":"\n\t\n\t\t\n\t\tOverview\n\t\n\nThis large-scale dataset is designed for pre-training state-of-the-art text embedding models. Its goal is to reproduce and build upon the data recipe described in the mGTE technical report (Zhang et al., 2024), which details the data sources used to train the GTE family of embedding models but does not release the data itself.\nWe assembled this dataset as part of a research effort to understand how data composition affects retrieval model quality. Our experiments confirmed… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/embeddings-pre-training.","downloads":3589,"tags":["size_categories:1B<n<10B","modality:tabular","modality:text","arxiv:2407.19669","region:eu"],"createdAt":"2025-08-18T14:36:39.000Z","key":""},{"_id":"68a7fb1964bbb2ec688d1af2","id":"behavior-1k/2025-challenge-demos","author":"behavior-1k","disabled":false,"gated":false,"lastModified":"2025-12-02T06:22:23.000Z","likes":36,"trendingScore":2,"private":false,"sha":"33639692425296185f245aaa648530178c7b3d2b","description":"This dataset was created using LeRobot.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\nmeta/info.json:\n{\n    \"codebase_version\": \"v2.1\",\n    \"robot_type\": \"R1Pro\",\n    \"total_episodes\": 10000,\n    \"total_frames\": 119094660,\n    \"total_tasks\": 50,\n    \"total_videos\": 90000,\n    \"chunks_size\": 10000,\n    \"fps\": 30,\n    \"splits\": {\n        \"train\": \"0:10000\"\n    },\n    \"data_path\": \"data/task-{episode_chunk:04d}/episode_{episode_index:08d}.parquet\",\n    \"video_path\":… See the full description on the dataset page: https://huggingface.co/datasets/behavior-1k/2025-challenge-demos.","downloads":72041,"tags":["task_categories:robotics","license:mit","modality:video","arxiv:2403.09227","doi:10.57967/hf/6394","region:us","LeRobot","v","2",".","1"],"createdAt":"2025-08-22T05:07:37.000Z","key":""},{"_id":"68aefc94ff689607989927fc","id":"codersan/Mizan","author":"codersan","disabled":false,"gated":false,"lastModified":"2025-08-31T12:24:27.000Z","likes":2,"trendingScore":2,"private":false,"sha":"79f32c87c23cf8ae4666cf08d85de7a1a71f4d64","downloads":22,"tags":["size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-08-27T12:39:48.000Z","key":""},{"_id":"68b6eaf528cb5c633db258a7","id":"paulgavrikov/visualoverload","author":"paulgavrikov","disabled":false,"gated":false,"lastModified":"2026-04-15T14:04:26.000Z","likes":10,"trendingScore":2,"private":false,"sha":"93d7fbab3f4b2f759ff388022dc18c7a8fd96f80","description":"\n\t\n\t\t\n\t\tVisualOverload (CVPR 2026)\n\t\n\n\n\n\n\n\n[📚 Paper] \n[💻 Code]\n[🌐 Project Page]\n[🏆 Leaderboard]\n[🎯 Online Evaluator]\n\n\nIs basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question–answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform… See the full description on the dataset page: https://huggingface.co/datasets/paulgavrikov/visualoverload.","downloads":554,"tags":["task_categories:visual-question-answering","task_categories:image-text-to-text","language:en","license:cc-by-sa-4.0","size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2509.25339","region:us","art"],"createdAt":"2025-09-02T13:02:45.000Z","key":""},{"_id":"68baa3daeb995f8daef25979","id":"HuggingFaceVLA/libero","author":"HuggingFaceVLA","disabled":false,"gated":false,"lastModified":"2025-09-30T09:45:51.000Z","likes":63,"trendingScore":2,"private":false,"sha":"86958911c0f959db2bbbdb107eb3e17c5f9c798e","description":"This dataset was created using LeRobot.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\nmeta/info.json:\n{\n    \"codebase_version\": \"v3.0\",\n    \"robot_type\": \"panda\",\n    \"total_episodes\": 1693,\n    \"total_frames\": 273465,\n    \"total_tasks\": 40,\n    \"chunks_size\": 1000,\n    \"fps\": 10.0,\n    \"splits\": {\n        \"train\": \"0:1693\"\n    },\n    \"data_path\": \"data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet\",\n    \"video_path\": \"videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4\"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.","downloads":25754,"tags":["task_categories:robotics","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:image","modality:timeseries","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","LeRobot"],"createdAt":"2025-09-05T08:48:26.000Z","key":""},{"_id":"68bf8e31a571e17bf5e3f7dd","id":"m-a-p/DeepWriting-20K","author":"m-a-p","disabled":false,"gated":false,"lastModified":"2025-09-09T02:20:00.000Z","likes":35,"trendingScore":2,"private":false,"sha":"45ff2a40438236dd442b0bfe02add96115184e1d","downloads":64,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-09-09T02:17:21.000Z","key":""},{"_id":"68d267c18a7c72e4280caa21","id":"MathLLMs/VoiceAssistant-Eval","author":"MathLLMs","disabled":false,"gated":false,"lastModified":"2025-10-21T08:13:38.000Z","likes":12,"trendingScore":2,"private":false,"sha":"5b6260bc6fe2f90cb857befca73ac9916399da5a","description":"\n\t\n\t\t\n\t\t🔥 VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing\n\t\n\n \n \n \n \n\n\n\n\n\n\n\n[🌐 Homepage] \n[🔮 Visualization]\n[💻 Github] \n[📖 Paper]\n[📊 Leaderboard ] \n[📊 Detailed Leaderboard ] \n[📊 Roleplay Leaderboard ] \n\n\n\n\n\n\t\n\t\t\n\t\t🚀 Data Usage\n\t\n\nfrom datasets import load_dataset\n\nfor split in ['listening_general', 'listening_music', 'listening_sound', 'listening_speech', \n'speaking_assistant', 'speaking_emotion', 'speaking_instruction_following'… See the full description on the dataset page: https://huggingface.co/datasets/MathLLMs/VoiceAssistant-Eval.","downloads":17978,"tags":["task_categories:question-answering","task_categories:visual-question-answering","task_categories:audio-to-audio","task_categories:any-to-any","task_categories:multiple-choice","task_categories:text-generation","license:mit","size_categories:10K<n<100K","format:parquet","modality:text","modality:audio","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2509.22651","region:us","audio","multimodal","listening","speaking","viewing","question-answering","audio-understanding","reasoning","instruction-following","roleplay","safety","emotion","robustness"],"createdAt":"2025-09-23T09:26:25.000Z","key":""},{"_id":"68d9e9c16d8ec4acf7ee2661","id":"Fortytwo-Network/Strandset-Rust-v1","author":"Fortytwo-Network","disabled":false,"gated":false,"lastModified":"2026-01-05T17:18:07.000Z","likes":38,"trendingScore":2,"private":false,"sha":"0a8d223302712a2b34a6ad4ce1fd679031894b3d","description":"\n\t\n\t\t\n\t\tStrandset-Rust-v1\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nStrandset-Rust-v1 is a large, high-quality synthetic dataset built to advance code modeling for the Rust programming language.Generated and validated through Fortytwo’s Swarm Inference, it contains 191,008 verified examples across 15 task categories, spanning code generation, bug detection, refactoring, optimization, documentation, and testing.\nRust’s unique ownership and borrowing system makes it one of the most challenging languages for… See the full description on the dataset page: https://huggingface.co/datasets/Fortytwo-Network/Strandset-Rust-v1.","downloads":267,"tags":["license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2510.24801","arxiv:2409.08386","region:us","code"],"createdAt":"2025-09-29T02:06:57.000Z","key":""},{"_id":"68edbd45390870c8fb777cc0","id":"jackluoluo/ArchCAD","author":"jackluoluo","disabled":false,"gated":"manual","lastModified":"2025-10-17T14:57:00.000Z","likes":52,"trendingScore":2,"private":false,"sha":"035db66d2c5794395f63d793324b3fe3d360ffe3","description":"\n\t\n\t\t\n\t\t🏗️ ArchCAD\n\t\n\n🇺🇸 English | 🇨🇳 中文说明\n\n \n     \n\n\nA Multimodal CAD Dataset for Vectorized Drawing Understanding\n\n40k Samples · 5 Strictly Aligned Modalities · Foundational Data for AI Understanding of Engineering Drawings\n\n\n\n\t\t\n\t\t📑 Table of Contents\n\t\n\n\nWhat is ArchCAD?\nKey Features\nDataset Structure\nData Modalities\nAnnotations\n\n\nBaseline Model: DPSS\nPotential Applications\nCitation\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📘 What is ArchCAD?\n\t\n\nAI systems have long struggled to interpret and utilize CAD… See the full description on the dataset page: https://huggingface.co/datasets/jackluoluo/ArchCAD.","downloads":299,"tags":["task_categories:visual-question-answering","task_categories:image-to-text","license:cc-by-nc-4.0","arxiv:2503.22346","region:us"],"createdAt":"2025-10-14T03:02:29.000Z","key":""},{"_id":"68f14a9cd770dfae97f0dd2c","id":"ai4ce/wanderland","author":"ai4ce","disabled":false,"gated":"manual","lastModified":"2026-06-17T18:24:21.000Z","likes":8,"trendingScore":2,"private":false,"sha":"278054cdfd1f294da91aa3088037760ed3705522","description":"\n\t\n\t\t\n\t\n\t\n\t\tWanderland Dataset\n\t\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nWanderland is a large-scale urban dataset designed for geometrically grounded simulation and open-world embodied AI research. The dataset contains diverse urban scenes captured with dual fisheye cameras, providing high-quality data for 3D reconstruction, novel view synthesis, and navigation tasks.\n\n\t\n\t\t\n\t\n\t\n\t\tKey Features\n\t\n\n\nUrban Scenes: Diverse outdoor environments with varying complexity\nMulti-Modal Data: RGB… See the full description on the dataset page: https://huggingface.co/datasets/ai4ce/wanderland.","downloads":1914,"tags":["task_categories:robotics","license:apache-2.0","size_categories:100K<n<1M","format:text","modality:3d","modality:text","library:datasets","library:mlcroissant","arxiv:2511.20620","region:us","3d-reconstruction","novel-view-synthesis","embodied-ai","navigation","urban-scenes","gaussian-splatting","colmap"],"createdAt":"2025-10-16T19:42:20.000Z","key":""},{"_id":"68f2a7bb4328e4e10a4e263e","id":"allenai/olmoearth_pretrain_dataset","author":"allenai","disabled":false,"gated":false,"lastModified":"2025-11-01T04:25:39.000Z","likes":17,"trendingScore":2,"private":false,"sha":"27a2be0d820196aff3b700a94c0874d780da3caf","description":"This is the pre-training dataset for training the OlmoEarth pre-trained remote sensing foundation models.\nDocumentation is on GitHub at https://github.com/allenai/olmoearth_pretrain/blob/main/docs/Pretraining-Dataset.md\nThe dataset is released under CC BY 4.0. It includes data from the following sources:\n\nSentinel-2 L2A imagery from the European Space Agency, available under the Copernicus Sentinel Data and Service Legal Notice\nSentinel-1 GRD IW vv+vh imagery from the European Space Agency… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmoearth_pretrain_dataset.","downloads":3694,"tags":["license:cc-by-4.0","size_categories:100M<n<1B","format:webdataset","modality:image","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us"],"createdAt":"2025-10-17T20:31:55.000Z","key":""},{"_id":"68fee6b947e8b83fe46d1bb1","id":"pnnbao-ump/VieNeu-TTS-140h","author":"pnnbao-ump","disabled":false,"gated":"auto","lastModified":"2026-05-31T05:55:32.000Z","likes":30,"trendingScore":2,"private":false,"sha":"a5f8845053018f68467d45d5804b83711c7c1a01","description":"\n\t\n\t\t\n\t\n\t\n\t\tpnnbao-ump/VieNeu-TTS-140h\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tMô tả Dataset\n\t\n\nA high-quality Vietnamese Text-to-Speech (TTS) dataset containing 74,858 audio samples with phonemized transcripts. This benchmark dataset is designed for fine-tuning modern TTS models with maximum synthesis quality. The text corpus is completely phonemized using standard international phonetic alphabet (IPA) representations suitable for neural acoustic modeling.\n\n\t\n\t\t\n\t\n\t\n\t\tQuick Facts\n\t\n\n\nLanguage: Vietnamese 🇻🇳\nTasks:… See the full description on the dataset page: https://huggingface.co/datasets/pnnbao-ump/VieNeu-TTS-140h.","downloads":595,"tags":["task_categories:text-to-speech","task_categories:automatic-speech-recognition","language:vi","license:apache-2.0","size_categories:10K<n<100K","format:arrow","modality:audio","modality:text","library:datasets","library:mlcroissant","doi:10.57967/hf/7428","region:us","vietnamese","tts","speech","phonemized","multi-speaker"],"createdAt":"2025-10-27T03:27:53.000Z","key":""},{"_id":"6901f3d61e56d5c452b8dd24","id":"meituan-longcat/AMO-Bench","author":"meituan-longcat","disabled":false,"gated":false,"lastModified":"2026-02-05T08:43:07.000Z","likes":36,"trendingScore":2,"private":false,"sha":"2f422616c25d862984408fbbfaed63a961e8e025","description":"\n\t\n\t\t\n\t\t📐 AMO-Bench: Large Language Models Still Struggle in High School Math Competitions\n\t\n\n\n📄 Paper\n🌐 Project Page\n💻 Github Repo\n\n\n\t\n\t\t\n\t\tUpdates\n\t\n\n\n2026.02.05: Leaderboard Update: Qwen3-Max-Thinking achieves a new SOTA with 65.1%, while GLM-4.7 sets a new open-source record at 62.4%! \n2025.12.01: We have added Token Efficiency showing the number of output tokens used by models in the leaderboard. Gemini 3 Pro achieves the highest token efficiency among top-performance models!… See the full description on the dataset page: https://huggingface.co/datasets/meituan-longcat/AMO-Bench.","downloads":2634,"tags":["task_categories:question-answering","language:en","license:mit","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2510.26768","arxiv:2508.19413","arxiv:2508.03927","region:us","Large Language Models","Reasoning Models","Mathematical Reasoning","mathematics"],"createdAt":"2025-10-29T11:00:38.000Z","key":""},{"_id":"69109b3f0570b41ffe28a6ea","id":"builddotai/Egocentric-10K","author":"builddotai","disabled":false,"gated":"auto","lastModified":"2026-02-16T05:58:56.000Z","likes":334,"trendingScore":2,"private":false,"sha":"3e5f87c88c54ce8343865d8e2a8c171f18385a05","description":"\nEgocentric-10K is the largest egocentric dataset. It is the first dataset collected exclusively in real factories.\n\n  \n  Your browser does not support the video tag.\n\n\n\nEgocentric-10K is state-of-the-art in hand visibility and active manipulation density compared to previous in-the-wild egocentric datasets. The complete 30,000 frame evaluation set is available at Egocentric-10K-Evaluation.\n\n\n\t\n\t\t\n\t\tDataset Statistics\n\t\n\n\n\t\n\t\t\nAttribute\nValue\n\n\n\t\t\nTotal Hours\n10,000\n\n\nTotal Frames\n1.08 billion… See the full description on the dataset page: https://huggingface.co/datasets/builddotai/Egocentric-10K.","downloads":111475,"tags":["license:apache-2.0","region:us"],"createdAt":"2025-11-09T13:46:39.000Z","key":""},{"_id":"6928ac839f54f92be8b78d70","id":"TeichAI/claude-4.5-opus-high-reasoning-250x","author":"TeichAI","disabled":false,"gated":false,"lastModified":"2025-11-28T03:02:41.000Z","likes":398,"trendingScore":2,"private":false,"sha":"742c86f88b66bf53cb5961a25e4360f5582f4a6e","description":"This is a reasoning dataset created using Claude Opus 4.5 with a reasoning depth set to high. Some of these questions are from reedmayhew and the rest were generated.\nThe dataset is meant for creating distilled versions of Claude Opus 4.5 by fine-tuning already existing open-source LLMs.\n\n\t\n\t\t\n\t\tStats\n\t\n\n\nCosts: $ 52.3 (USD)\nTotal tokens (input + output): 2.13 M\n\n","downloads":939,"tags":["size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-11-27T19:54:43.000Z","key":""},{"_id":"692ef4d45387ef3ba08f1414","id":"amd/ReasonLite-Dataset","author":"amd","disabled":false,"gated":false,"lastModified":"2026-01-22T06:24:37.000Z","likes":15,"trendingScore":2,"private":false,"sha":"a817d32938905e15d1ec6e16f8084afe15612f11","description":"\n  \n\n\n\nGitHub |\nDataset |\nBlog \n\n\n\n\nReasonLite is an ultra-lightweight math reasoning model. With only 0.6B parameters, it leverages high-quality data distillation to achieve performance comparable to models over 10× its size, such as Qwen3-8B, reaching 75.2 on AIME24 and extending the scaling law of small models.\n\n🔥 Best-performing 0.6B math reasoning model\n🔓 Fully open-source — weights, scripts, datasets, synthesis pipeline⚙️ Distilled in two stages to balance efficiency and high… See the full description on the dataset page: https://huggingface.co/datasets/amd/ReasonLite-Dataset.","downloads":404,"tags":["license:openrail","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-12-02T14:16:52.000Z","key":""},{"_id":"6930946cf0df0882f6074e51","id":"deepreinforce-ai/CUDA-L2","author":"deepreinforce-ai","disabled":false,"gated":false,"lastModified":"2025-12-03T19:51:10.000Z","likes":3,"trendingScore":2,"private":false,"sha":"d8c15e5880137174efa510647b7858b81dbcaf2a","description":"\n\n  \n      \n  \n\n\n\n\nCUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning\n\n\n\n\n\n\t\n\t\t\n\t\t🥳 Introduction\n\t\n\nCUDA-L2 is a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art NVIDIA closed-source libraries (cuBLAS… See the full description on the dataset page: https://huggingface.co/datasets/deepreinforce-ai/CUDA-L2.","downloads":561,"tags":["size_categories:1K<n<10K","format:csv","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2512.02551","region:us"],"createdAt":"2025-12-03T19:50:04.000Z","key":""},{"_id":"6931046ceb0ce6e2002af81b","id":"builddotai/Egocentric-100K","author":"builddotai","disabled":false,"gated":"auto","lastModified":"2026-02-16T05:56:56.000Z","likes":126,"trendingScore":2,"private":false,"sha":"fae604b751b25337d6fd8c4c53e595910c28f68f","description":"\nEgocentric-100K is the largest dataset of manual labor. You can visualize the dataset here.\n\nEgocentric-100K is state-of-the-art in hand visibility and active manipulation density compared to previous in-the-wild egocentric datasets. The complete 30,000 frame evaluation set is available at Egocentric-100K-Evaluation.\n\n\n\t\n\t\t\n\t\tDataset Statistics\n\t\n\n\n\t\n\t\t\nAttribute\nValue\n\n\n\t\t\nTotal Hours\n100,405\n\n\nTotal Frames\n10.8 billion\n\n\nVideo Clips\n2,010,759\n\n\nMedian Clip Length\n180.0 seconds\n\n\nMean Hours… See the full description on the dataset page: https://huggingface.co/datasets/builddotai/Egocentric-100K.","downloads":47735,"tags":["license:apache-2.0","size_categories:1M<n<10M","format:webdataset","modality:text","modality:video","library:datasets","library:webdataset","library:mlcroissant","region:us","video","egocentric","robotics"],"createdAt":"2025-12-04T03:47:56.000Z","key":""},{"_id":"69335fc72b818a1f0d013464","id":"open-thoughts/OpenThoughts-Agent-v1-SFT","author":"open-thoughts","disabled":false,"gated":false,"lastModified":"2026-01-27T22:59:08.000Z","likes":97,"trendingScore":2,"private":false,"sha":"c5dc896981f4e3b7c5382669b1d1be0bc4b6a1a6","description":"\n    \n\n\n\nProject |\nSFT dataset |\nRL dataset |\nSFT model |\nRL model\n\n\n\n\n\t\n\t\n\t\n\t\tOpenThinker-Agent-v1-SFT\n\t\n\nOpenThoughts-Agent is an open-source effort to curate the best datasets for training agents. Our first release includes datasets, models and our research codebase.\nOpenThinker-Agent-v1 is a model trained for agentic tasks such as Terminal-Bench 2.0 and SWE-Bench.\nThe OpenThinker-Agent-v1 model is post-trained from Qwen/Qwen3-8B.\nIt is SFT-ed on the OpenThoughts-Agent-v1-SFT dataset, then… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-Agent-v1-SFT.","downloads":6965,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agents","terminal","code","software-engineering"],"createdAt":"2025-12-05T22:42:15.000Z","key":""},{"_id":"693c6c0dc9d7af74f700ef72","id":"nvidia/Nemotron-CC-v2.1","author":"nvidia","disabled":false,"gated":"manual","lastModified":"2025-12-22T17:09:52.000Z","likes":131,"trendingScore":2,"private":false,"sha":"ba6f2aaef7ada865bb08fc08640ca292150097db","description":"\n\t\n\t\t\n\t\tNemotron-Pre-Training-Dataset-v2.1\n\t\n\n\n\t\n\t\t\n\t\tDataset  Description\n\t\n\nThe  Nemotron-Pre-Training-Dataset-v2.1  extends  the  previously  released  Nemotron  pretraining  datasets  with  refreshed,  higher-quality,  and  more  diverse  data  across  math,  code,  English  Common  Crawl,  and  large-scale  synthetic  corpora.  Designed  for  the  NVIDIA  Nemotron  3  family  of  LLMs,  the  dataset  introduces  new  Common  Crawl  code  extraction,  2.5T  new  English  web  tokens… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.1.","downloads":5862,"tags":["task_categories:text-generation","license:other","size_categories:1B<n<10B","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2508.14444","arxiv:2508.15096","arxiv:2412.02595","arxiv:2505.02881","region:us"],"createdAt":"2025-12-12T19:25:01.000Z","key":""},{"_id":"6944d5ae5f44ccf243c7644e","id":"Beifengdo/gailvlunyushulitongji","author":"Beifengdo","disabled":false,"gated":false,"lastModified":"2026-06-24T01:30:46.000Z","likes":2,"trendingScore":2,"private":false,"sha":"bea559b676604516da04002bbb200367fad37d8e","description":"\n\t\n\t\t\n\t\n\t\n\t\t概率论与数理统计\n\t\n\n","downloads":433,"tags":["size_categories:n<1K","modality:document","modality:image","library:datasets","library:mlcroissant","region:us"],"createdAt":"2025-12-19T04:33:50.000Z","key":""},{"_id":"695138b5329f4825326ac6c8","id":"facebook/research-plan-gen","author":"facebook","disabled":false,"gated":false,"lastModified":"2026-01-02T14:56:26.000Z","likes":302,"trendingScore":2,"private":false,"sha":"8ae1ba08759afd32b20eb959ea6addf25c2c0929","description":"\n\t\n\t\t\n\t\tRPG Dataset\n\t\n\nResearch Plan Generation dataset with three subsets covering ML, Arxiv, and PubMed research papers. Each subset contains research tasks with evaluation rubrics and reference solutions.\n\n\t\n\t\t\n\t\tDataset Statistics\n\t\n\n\n\t\n\t\t\nSubset\nTrain\nTest\nTotal\n\n\n\t\t\nML\n6,872\n685\n7,557\n\n\nArxiv\n6,573\n1,496\n8,069\n\n\nPubmed\n6,423\n464\n6,887\n\n\nTotal\n19,868\n2,645\n22,513\n\n\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tLoading the Dataset\n\t\n\nfrom datasets import load_dataset\n\n# Load a specific subset\nml_data =… See the full description on the dataset page: https://huggingface.co/datasets/facebook/research-plan-gen.","downloads":468,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2512.23707","region:us"],"createdAt":"2025-12-28T14:03:33.000Z","key":""},{"_id":"696552c844f950f64be9b539","id":"openai/ih-challenge","author":"openai","disabled":false,"gated":false,"lastModified":"2026-01-12T21:17:39.000Z","likes":21,"trendingScore":2,"private":false,"sha":"056b7d94345dd4f8049da75bd70617d8928ac586","description":"\n\t\n\t\t\n\t\tIH-Challenge\n\t\n\nTraining dataset from our paper Large-Scale RLVR Improves Instruction Hierarchy on Frontier LLMs.\n\n\t\n\t\t\n\t\tWarning About Company Names\n\t\n\nTo avoid legal and reputational risk, we replaced all company names in the original dataset for either COMPETITOR_i or BRAND_i, with i ∈ ℕ. We recommend that you replace these placeholders with real company names before training on the dataset.\n\n\t\n\t\t\n\t\tData Schema\n\t\n\n\n\t\n\t\t\nField\nType\nDescription\n\n\n\t\t\nattacker_meta_problem\nstr\nGeneral… See the full description on the dataset page: https://huggingface.co/datasets/openai/ih-challenge.","downloads":403,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-01-12T20:00:08.000Z","key":""},{"_id":"696789567b115954f1c68ab0","id":"openbmb/UltraData-Math","author":"openbmb","disabled":false,"gated":false,"lastModified":"2026-04-15T12:28:05.000Z","likes":321,"trendingScore":2,"private":false,"sha":"fe10db8efd35597fd7fcff8ff576b5ec4ea5ff87","description":"\n\t\n\t\t\n\t\tUltraData-Math\n\t\n\n\n  \n\n\n\n🤗 Dataset | 💻 Source Code | 🇨🇳 中文 README\n\n\nUltraData-Math is a large-scale, high-quality mathematical pre-training dataset totaling 290B+ tokens across three progressive tiers—L1 (170.5B tokens web corpus), L2 (33.7B tokens quality-selected), and L3 (88B tokens multi-format refined)—designed to systematically enhance mathematical reasoning in LLMs. It has been applied to the mathematical pre-training of the MiniCPM Series models.\nIt was introduced in the… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/UltraData-Math.","downloads":32176,"tags":["task_categories:text-generation","language:en","language:zh","license:apache-2.0","size_categories:100M<n<1B","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2602.09003","region:us","llm","pretraining","math","data-synthesis","data-filtering","high-quality","mathematical-reasoning"],"createdAt":"2026-01-14T12:17:26.000Z","key":""},{"_id":"696b2406e6c69ff4f49745f4","id":"sojuL/RubricHub_v1","author":"sojuL","disabled":false,"gated":false,"lastModified":"2026-02-03T03:09:05.000Z","likes":269,"trendingScore":2,"private":false,"sha":"3837d55971473a872e84879c88f708b8da3ec2ef","description":"\n\t\n\t\t\n\t\tRubricHub\n\t\n\n\n\n\n\n    \n\n\n    \n\n\n    \n\n\n    \n\n\n\n\n\nRubricHub is a large-scale (approximately 110K), multi-domain dataset that provides high-quality rubric-based supervision for open-ended generation tasks. It is constructed via an automated coarse-to-fine rubric generation framework, which integrates principle-guided synthesis, multi-model aggregation, and difficulty evolution to produce comprehensive and highly discriminative evaluation criteria, overcoming the supervision ceiling of… See the full description on the dataset page: https://huggingface.co/datasets/sojuL/RubricHub_v1.","downloads":864,"tags":["task_categories:text-generation","task_categories:reinforcement-learning","task_categories:question-answering","language:zh","language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2601.08430","region:us","medical","science","wirting","isntruction","chat","general"],"createdAt":"2026-01-17T05:54:14.000Z","key":""},{"_id":"6982f232635d65d58755f261","id":"MCG-NJU/Seeker-173K","author":"MCG-NJU","disabled":false,"gated":false,"lastModified":"2026-02-10T08:22:08.000Z","likes":5,"trendingScore":2,"private":false,"sha":"dce284fed402f59bc8642f011aa2fadd4728d736","downloads":282,"tags":["license:mit","region:us"],"createdAt":"2026-02-04T07:16:02.000Z","key":""},{"_id":"698b2c8b4c9e577aa3b1fa16","id":"nohurry/Opus-4.6-Reasoning-3000x-filtered","author":"nohurry","disabled":false,"gated":false,"lastModified":"2026-03-31T12:43:36.000Z","likes":624,"trendingScore":2,"private":false,"sha":"1cd388e9e1172066092a2b53e33dbdd3249b77bd","description":"\n[!WARNING] NOTICE: The original dataset has been updated with better filtering. Please use the original dataset, not this one.\n\nFiltered from: https://huggingface.co/datasets/crownelius/Opus-4.6-Reasoning-3000x\nThe original dataset has 979 refusals, I removed these in this version.\n","downloads":1780,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-02-10T13:03:07.000Z","key":""},{"_id":"698e4ad0913c4d1f4a64479a","id":"Crownelius/Opus-4.6-Reasoning-3300x","author":"Crownelius","disabled":false,"gated":false,"lastModified":"2026-05-09T11:52:43.000Z","likes":309,"trendingScore":2,"private":false,"sha":"f4f929af994e6be73f2aa69178b4673f60f03359","description":"\n\n\t\n\t\t\n\t\tOpus-4.6-Reasoning-3000x (Cleaned)\n\t\n\nThis dataset has been automatically cleaned to remove:\n\nEmpty or missing responses\nResponses shorter than 10 characters\nRefusal responses (\"problem is incomplete\", \"cannot solve\", etc.)\nResponses with no substantive content\nResponses that just echo the problem\n\n\n\t\n\t\t\n\t\tCleaning Report\n\t\n\n\nOriginal rows: 3,305\nClean rows: 2,160\nRemoved: 1,145 (34.6%)\nColumns: ['id', 'problem', 'thinking', 'solution', 'difficulty', 'category', 'timestamp', 'hash']… See the full description on the dataset page: https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x.","downloads":894,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-02-12T21:49:04.000Z","key":""},{"_id":"698eb70ba75fc5fa76e7c3c1","id":"tensorxt/ViMedCSS","author":"tensorxt","disabled":false,"gated":false,"lastModified":"2026-02-20T03:33:18.000Z","likes":15,"trendingScore":2,"private":false,"sha":"b6959a18a08739464733930a872e7125c03e6558","description":"\n\t\n\t\t\n\t\t🩺 ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset (LREC 2026)\n\t\n\n\n\t\n\t\t\n\t\t📖 Overview\n\t\n\nViMedCSS is a Vietnamese medical speech dataset for code-switching ASR, where each utterance contains at least one non-Vietnamese (mainly English) medical term embedded in Vietnamese speech.\n\n\t\n\t\t\n\t\t📊 Dataset Statistics\n\t\n\n\n\t\n\t\t\n\t\tSplit Statistics (from ViMedCSS-Metadata)\n\t\n\n\n\t\n\t\t\nSplit\n# Rows\nDuration (hours)\nAvg duration (s)\nTotal CS terms\n\n\n\t\t\ntrain\n11,832\n24.30\n7.39\n12,314… See the full description on the dataset page: https://huggingface.co/datasets/tensorxt/ViMedCSS.","downloads":894,"tags":["task_categories:automatic-speech-recognition","language:vi","license:cc-by-4.0","size_categories:10K<n<100K","modality:audio","modality:text","arxiv:2602.12911","region:us","medical","code-switching"],"createdAt":"2026-02-13T05:30:51.000Z","key":""},{"_id":"699250f08be5bf8321aeb29e","id":"HuggingFaceFW/finephrase","author":"HuggingFaceFW","disabled":false,"gated":false,"lastModified":"2026-03-31T06:26:09.000Z","likes":131,"trendingScore":2,"private":false,"sha":"78cf4a5ed0099214979c094c963e699c19163838","description":"\n\t\n\t\t\n\t\tDataset Card for HuggingFaceFW/finephrase\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nSynthetic data generated by DataTrove:\n\nModel: HuggingFaceTB/SmolLM2-1.7B-Instruct (main)\nSource dataset: HuggingFaceFW/fineweb-edu, config sample-350BT, split train\nGeneration config: temperature=1.0, top_p=1.0, top_k=50, max_tokens=2048, model_max_context=8192\nSpeculative decoding: {\"method\":\"suffix\",\"num_speculative_tokens\":32}\nSystem prompt: None\nInput column: text\n\nPrompt families:\n\nfaq prompt\n\nRewrite the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/finephrase.","downloads":404242,"tags":["task_categories:text-generation","task_ids:language-modeling","annotations_creators:machine-generated","language_creators:found","source_datasets:HuggingFaceFW/fineweb-edu/sample-350BT","language:en","license:odc-by","size_categories:1B<n<10B","modality:tabular","modality:text","region:us","SmolLM2-1.7B-Instruct","fineweb-edu","synthetic","datatrove"],"createdAt":"2026-02-15T23:04:16.000Z","key":""},{"_id":"69961b3afb4584a79827ed7f","id":"ASSISTments/FoundationalASSIST","author":"ASSISTments","disabled":false,"gated":"manual","lastModified":"2026-06-22T20:11:51.000Z","likes":35,"trendingScore":2,"private":false,"sha":"9dc404eef97d3d20f066d0f263f048fd91fab09f","description":"Access requests will be faster if you have a university or research-affiliated email associated with your Hugging Face account\nUnfortunate news: Please note that this project is primarily supported by federal grants from the US government. As such, we need to follow certain regulations. Sadly, one is that we cannot share data with researchers from countries designated as \"Countries of concern\". We hope to soon share this dataset with the many fabulous researchers from these countries, but as… See the full description on the dataset page: https://huggingface.co/datasets/ASSISTments/FoundationalASSIST.","downloads":166,"tags":["license:cc-by-nc-4.0","size_categories:1M<n<10M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2602.00070","region:us"],"createdAt":"2026-02-18T20:04:10.000Z","key":""},{"_id":"699946473ccabf2d24116f0f","id":"Roman1111111/gemini-3.1-pro-hard-high-reasoning","author":"Roman1111111","disabled":false,"gated":false,"lastModified":"2026-02-21T05:50:10.000Z","likes":56,"trendingScore":2,"private":false,"sha":"5b9be1b2b8087b748a8a36c4d47631722d3b3d8e","description":"\n\t\n\t\t\n\t\tDataset Card for Gemini-3.1-Pro-Ultra-Reasoning-5.6M\n\t\n\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThis dataset represents the frontier of synthetic reasoning data, generated by Gemini 3.1 Pro (High Reasoning variant). While smaller in total token volume than its predecessors (5.6M tokens), this corpus prioritizes logical density and multi-step verification. \nThe move to the 3.1 architecture provides a measurable leap in \"System 2\" thinking. Unlike standard models… See the full description on the dataset page: https://huggingface.co/datasets/Roman1111111/gemini-3.1-pro-hard-high-reasoning.","downloads":266,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:mit","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","code","finance","legal","agent","chemistry","physics","synthetic","gemini-3.1-pro","high-reasoning","expert-level"],"createdAt":"2026-02-21T05:44:39.000Z","key":""},{"_id":"69aebfc14dc143297f6fa3ff","id":"Parexel/clinical-trials-protocols","author":"Parexel","disabled":false,"gated":false,"lastModified":"2026-06-03T19:12:01.000Z","likes":3,"trendingScore":2,"private":false,"sha":"20643ff6af4ef8e8cb07dbf7d99564774727432f","downloads":1131,"tags":["size_categories:10K<n<100K","modality:document","library:datasets","library:mlcroissant","region:us"],"createdAt":"2026-03-09T12:40:33.000Z","key":""},{"_id":"69af01ae7aed45131a975ddb","id":"prodnull/prompt-injection-repo-dataset","author":"prodnull","disabled":false,"gated":"auto","lastModified":"2026-03-11T01:45:20.000Z","likes":6,"trendingScore":2,"private":false,"sha":"6c56b897ee4a328fb4f41f4b0d334f5d2db4482a","description":"\n\t\n\t\t\n\t\n\t\n\t\tPrompt Injection Repository File Dataset\n\t\n\nA labeled dataset for detecting prompt injection attacks in repository files — code, configs, READMEs, CI/CD workflows, and documentation that AI coding agents process as context.\n\n\t\n\t\t\n\t\n\t\n\t\tWhat This Is (and Isn't)\n\t\n\nThis dataset targets a specific threat: indirect prompt injection via repository content. When AI coding agents (Claude Code, Cursor, Copilot, Gemini CLI) clone a repo, every file becomes part of the agent's context.… See the full description on the dataset page: https://huggingface.co/datasets/prodnull/prompt-injection-repo-dataset.","downloads":108,"tags":["task_categories:text-classification","language:en","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2509.22040","arxiv:2601.17548","arxiv:2602.10453","arxiv:2503.14281","arxiv:2602.14161","region:us","prompt-injection","security","code-security","ai-agent-security","repository-scanning"],"createdAt":"2026-03-09T17:21:50.000Z","key":""},{"_id":"69b1f8513fadc91fa275ece6","id":"GaryYang123/zh-meme-sft-8k","author":"GaryYang123","disabled":false,"gated":false,"lastModified":"2026-04-20T05:17:06.000Z","likes":82,"trendingScore":2,"private":false,"sha":"84838bb5b1b023325499012ba956c485bbf592b4","description":"\n\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tzh-meme-sft-8k\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\t📖 简介 | Introduction\n\t\n\nzh-meme-sft-8k 是一个高质量的中文互联网梗文化指令微调数据集。该数据集基于抖音、小红书、B站等平台的真实评论互动构建，经过多轮清洗、增强和格式化处理，专门用于训练能够理解和使用网络热梗、具备幽默感的对话模型。\n\n🎯 这个数据集是 Meme-Qwen-7B-Instruct 模型的训练数据，如果你想看微调后的效果，可以直接体验模型！\n\n这个数据集的特点是：\n\n🎯 真实来源：基于真实社交平台的用户互动，保留原本网络表达🔄 对话结构：包含帖子-评论、评论-回复的完整对话链\n🧹 精细清洗：经过多轮规则清洗和LLM增强，去除噪声的同时保留热梗\n💬 ChatML格式：标准化为ChatML格式，开箱即用\n\n\n\n\t\n\t\t\n\t\t📊 数据统计 | Data Statistics\n\t\n\n\n\t\n\t\t\n数据集\n样本数量\n占比\n\n\n\t\t\n训练集\n7,377\n85%\n\n\n验证集\n868\n10%\n\n\n测试集\n435\n5%\n\n\n总计\n8,680… See the full description on the dataset page: https://huggingface.co/datasets/GaryYang123/zh-meme-sft-8k.","downloads":308,"tags":["task_categories:text-generation","language:zh","license:mit","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","conversational","dialog","chinese","Meme","humor"],"createdAt":"2026-03-11T23:18:41.000Z","key":""},{"_id":"69b276dc3fadc91fa2837158","id":"PolarSeeker/OpenSeeker-v1-Data","author":"PolarSeeker","disabled":false,"gated":false,"lastModified":"2026-03-17T20:45:50.000Z","likes":48,"trendingScore":2,"private":false,"sha":"5970e581dc1548f37d846a1955cb8b651d2b657e","description":"\n\n  OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data\n\n\n\n\n\n\n\n\nOpenSeeker is an open-source search agent system that democratizes access to frontier search capabilities by fully open-sourcing its training data. We fine-tuned Qwen3-30B-A3B-Thinking-2507 with 11.7K training examples and achieved state-of-the-art performance on frontier search benchmarks:\n\n\n\t\n\t\t\n\t\n\t\n\t\tHighlights\n\t\n\n\nSuperior performance on search agent benchmarks: 48.4 on BrowseComp-ZH, 29.5 on… See the full description on the dataset page: https://huggingface.co/datasets/PolarSeeker/OpenSeeker-v1-Data.","downloads":1229,"tags":["task_categories:question-answering","language:en","license:mit","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2603.15594","region:us","agent"],"createdAt":"2026-03-12T08:18:36.000Z","key":""},{"_id":"69b50e502b0587383a0e526b","id":"stepfun-ai/Step-3.5-Flash-SFT","author":"stepfun-ai","disabled":false,"gated":false,"lastModified":"2026-03-14T14:22:37.000Z","likes":340,"trendingScore":2,"private":false,"sha":"c994154a801557540c56af623f31b58c4770c652","description":"\n\t\n\t\t\n\t\tStep-3.5-Flash-SFT\n\t\n\nStep-3.5-Flash-SFT is a general-domain supervised fine-tuning release for chat models.\nThis repository keeps the full training interface in one place:\n\njson/: canonical raw training data\ntokenizers/: tokenizer snapshots for Step-3.5-Flash and Qwen3, released to preserve chat-template alignment\ncompiled/: tokenizer-specific compiled shards for StepTronOSS training\n\n\n\t\n\t\t\n\t\n\t\n\t\tData Format\n\t\n\nEach raw shard is a JSON file whose top level is a list of examples. Each… See the full description on the dataset page: https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SFT.","downloads":4732,"tags":["task_categories:text-generation","language:multilingual","license:apache-2.0","license:cc-by-nc-2.0","size_categories:1M<n<10M","region:us","chat","sft","instruction-tuning","reasoning","code","agent"],"createdAt":"2026-03-14T07:29:20.000Z","key":""},{"_id":"69b958ad0934e5f6fdfdc2d6","id":"nllg/DaTikZ-V4","author":"nllg","disabled":false,"gated":false,"lastModified":"2026-03-17T14:04:01.000Z","likes":5,"trendingScore":2,"private":false,"sha":"33734c83608211682be11001a1618856fc1979dd","description":"\n\t\n\t\t\n\t\tDataset Card for DaTikZ-V4\n\t\n\nDaTikZ-V4 is the dataset used to train TikZilla-3B, TikZilla-3B-RL, TikZilla-8B, and TikZilla-8B-RL for generating TikZ/LaTeX figures from natural language descriptions.\nThe TikZ code has been sourced from ArXiv, GitHub, and TeXStackExchange. Scientific figure descriptions were generated using Qwen2.5-VL-7B-Instruct.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset fields\n\t\n\nEach sample contains:\n\nfile_id: unique identifier\ncaption: original caption\nvlm_description: detailed visual… See the full description on the dataset page: https://huggingface.co/datasets/nllg/DaTikZ-V4.","downloads":743,"tags":["task_categories:text-generation","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","tikz","latex","code-generation","scientific-figures"],"createdAt":"2026-03-17T13:35:41.000Z","key":""},{"_id":"69bced7c4b535e550280d209","id":"wanglab/bioreason-pro-sft-reasoning-data","author":"wanglab","disabled":false,"gated":false,"lastModified":"2026-03-20T15:09:22.000Z","likes":5,"trendingScore":2,"private":false,"sha":"2cab301d674487c4d8536aff4353f2df3c5f8235","description":"\n🧬 BioReason-ProAdvancing Protein Function Prediction withMultimodal Biological Reasoning\n\n\n\n  \n  \n  \n  \n\n\n\t\n\t\t\n\t\tBioReason-Pro SFT Reasoning Data\n\t\n\nTraining dataset for supervised fine-tuning of BioReason-Pro. Contains proteins with synthetic reasoning traces generated by GPT-5, GO term annotations, InterPro domains, STRING protein-protein interactions, and protein metadata from UniProt.\n\n\t\n\t\t\n\t\tCitation\n\t\n\nIf you find this work useful, please cite our papers:\n@article… See the full description on the dataset page: https://huggingface.co/datasets/wanglab/bioreason-pro-sft-reasoning-data.","downloads":279,"tags":["language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2505.23579","region:us","protein","gene-ontology","function-prediction","biology","bioinformatics","reasoning"],"createdAt":"2026-03-20T06:47:24.000Z","key":""},{"_id":"69c45b9e5030946bd70055bf","id":"ianncity/KIMI-K2.5-1000000x","author":"ianncity","disabled":false,"gated":false,"lastModified":"2026-04-07T02:04:22.000Z","likes":264,"trendingScore":2,"private":false,"sha":"de244b70a988b37cecd56ab69052591b3f28e845","description":"\n \n \n\n\n\nKIMI-K2.5-1000000x\n\n\n1,000,000 reasoning traces distilled from KIMI-K2.5 on high reasoning, (Each subset has different questions)\n\n\n\nDistribution:\n\nCoding: 50% (Includes: Webdev, Python, C++, Java, JS, C, Ruby, Lua, Rust, and C#)\nScience: 20% (Physics, Chemistry, Biology) - 100k more completions in the PHD-Science subset\nMath: 15% (Algebra, Calculus, Probability) - 200k more completions in kimiMath200k.jsonl\nComputer Science: 5%\nLogical Questions: 5%\nCreative Writing: 5%… See the full description on the dataset page: https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x.","downloads":4165,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","reasoning","chain-of-thought","instruction-tuning","sft"],"createdAt":"2026-03-25T22:03:10.000Z","key":""},{"_id":"69c4b59c77a5f0fbd204f49e","id":"Luoberta/cve_train_v1.1","author":"Luoberta","disabled":false,"gated":false,"lastModified":"2026-03-27T07:11:43.000Z","likes":4,"trendingScore":2,"private":false,"sha":"315e12b6a44325cba1ae1efc9161227ce0dc25a5","description":"\n\t\n\t\t\n\t\tCVE-Factory Agent Traces v1.1\n\t\n\nThis dataset is an expanded version of cve_train, containing 18,783 distilled agent traces for CVE reproduction tasks. The traces were generated using Claude Opus 4.5 with a Mini SWE-Agent harness through the CVE-Factory pipeline.\n\n\t\n\t\t\n\t\n\t\n\t\tWhat's New in v1.1\n\t\n\nCompared to cve_train (v1.0):\n\n18.8k total samples (up from ~4k in v1.0)\n+3k agentic tasks from cve_tasks_3k_compressed\nAdditional traces from expanded CVE task coverage\n\t\n\t\t\n\t\tTraining… See the full description on the dataset page: https://huggingface.co/datasets/Luoberta/cve_train_v1.1.","downloads":88,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2602.03012","region:us","security","cve","vulnerability","agent-traces","sft","code"],"createdAt":"2026-03-26T04:27:08.000Z","key":""},{"_id":"69cc2a7078408474d1660501","id":"KIT-MRT/KITScenes-LongTail","author":"KIT-MRT","disabled":false,"gated":"auto","lastModified":"2026-04-06T20:55:17.000Z","likes":17,"trendingScore":2,"private":false,"sha":"88d7d6c7a97f4923ca58dde3bac292b1597f53db","description":"\n\t\n\t\t\n\t\tKITScenes LongTail Dataset\n\t\n\nIn real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. \nTo address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. \nWe provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization.\nThe resulting benchmark for multimodal models, such as VLMs and… See the full description on the dataset page: https://huggingface.co/datasets/KIT-MRT/KITScenes-LongTail.","downloads":176,"tags":["language:en","language:es","language:zh","license:cc-by-nc-4.0","size_categories:n<1K","format:parquet","format:optimized-parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2603.23607","region:us"],"createdAt":"2026-03-31T20:11:28.000Z","key":""},{"_id":"69ce35fdfe258982a4e2b279","id":"paperbd/paper_instructions_300K-v1","author":"paperbd","disabled":false,"gated":false,"lastModified":"2026-06-06T12:18:58.000Z","likes":13,"trendingScore":2,"private":false,"sha":"fc9f52b7bac8be22957b5c31c7c5bcd98e4069cb","description":"Loading will work as follows:\n\n\t\n\t\t\n\t\n\t\n\t\tExisting behavior\n\t\n\n# Loads the SFT dataset containing instruction, prompt, output\nload_dataset(\"paperbd/paper_instructions_300K-v1\") \n\n\n\t\n\t\t\n\t\n\t\n\t\tReasoning variant\n\t\n\n# Loads reasoning subset containing instruction, prompt, reasoning, output\nload_dataset(\n    \"paperbd/paper_instructions_300K-v1\",\n    \"reasoning\",\n    split=\"train\",\n)\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nThis dataset contains synthetic supervised fine-tuning data generated from academic… See the full description on the dataset page: https://huggingface.co/datasets/paperbd/paper_instructions_300K-v1.","downloads":601,"tags":["task_categories:question-answering","task_categories:summarization","task_categories:text-generation","language:en","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-04-02T09:25:17.000Z","key":""},{"_id":"69d185a53c023c2c9072697a","id":"netflix/Vera-Layered-Video-Dataset","author":"netflix","disabled":false,"gated":false,"lastModified":"2026-06-29T02:13:49.000Z","likes":2,"trendingScore":2,"private":false,"sha":"d56723462a1cfbeeb8854ef96a634eccdbb70b80","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset for Vera: A Layered Diffusion Model for Content-Preserving Video Editing\n\t\n\n\n  Hongkai Zheng¹²* &nbsp;·&nbsp;\n  Ta-Ying Cheng² &nbsp;·&nbsp;\n  Benjamin Klein² &nbsp;·&nbsp;\n  Yisong Yue² &nbsp;·&nbsp;\n  Zhuoning Yuan²†\n\n\n\n  ¹California Institute of Technology &nbsp;&nbsp; ²Netflix, Inc.\n  *Work done during an internship at Netflix &nbsp; †Project Lead\n\n\n\n  \n  \n  \n  \n\n\n\nTL;DR: A layered diffusion framework for video editing. Vera jointly generates an edit layer, an alpha… See the full description on the dataset page: https://huggingface.co/datasets/netflix/Vera-Layered-Video-Dataset.","downloads":10124,"tags":["task_categories:text-to-video","license:apache-2.0","size_categories:10K<n<100K","modality:video","arxiv:2606.23610","region:us","diffusion","layered-diffusion","video","layered-video-dataset","video-editing","video-generation"],"createdAt":"2026-04-04T21:41:57.000Z","key":""},{"_id":"69d638c6dd0ebed360d6ad0f","id":"ScienceOne-AI/S1-DeepResearch-15k","author":"ScienceOne-AI","disabled":false,"gated":false,"lastModified":"2026-04-14T18:14:00.000Z","likes":9,"trendingScore":2,"private":false,"sha":"1e9685c7c7277514a7eac1ba286a5b941cfca312","description":"\n\t\n\t\t\n\t\tS1-DeepResearch-15k Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe S1-DeepResearch dataset is a curated collection of approximately 15k samples designed to improve deep research capabilities of large language models.\nThe dataset includes two types of tasks:  \n\nVerifiable tasks (labeled as \"Closed-ended Multi-hop Resolution\")  \nOpen-ended tasks (labeled as \"Open-ended Exploration\")\n\n\n\t\n\t\t\n\t\tDataset Composition\n\t\n\nThe dataset is organized into five core capability dimensions:\n\nLong-chain complex… See the full description on the dataset page: https://huggingface.co/datasets/ScienceOne-AI/S1-DeepResearch-15k.","downloads":14621,"tags":["language:en","language:zh","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agent"],"createdAt":"2026-04-08T11:15:18.000Z","key":""},{"_id":"69d6b9ec54a04b1f8d2f2b82","id":"jedisct1/security-audits","author":"jedisct1","disabled":false,"gated":false,"lastModified":"2026-06-08T11:59:16.000Z","likes":9,"trendingScore":2,"private":false,"sha":"6d527ff0081eec6704c2a4f00e1ef8d308ae7366","description":"A collection of agent traces generated with Swival (not Claude Code, despite what the HF interface currently shows), an agent designed for open-source models.\nThese traces focus on security audits of opensource software.\n\n\t\n\t\t\n\t\n\t\n\t\tSharing traces with Swival\n\t\n\nSwival can export full conversation traces with --trace-dir, which writes one <session_id>.jsonl file per session:\nswival \"Fix the login bug\" --trace-dir traces/\n\nThose JSONL files use Swival's Claude Code compatible trace export, and… See the full description on the dataset page: https://huggingface.co/datasets/jedisct1/security-audits.","downloads":13872,"tags":["task_categories:text-generation","language:en","language:code","license:mit","size_categories:10K<n<100K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","agent-traces","coding-agent","swival"],"createdAt":"2026-04-08T20:26:20.000Z","key":""},{"_id":"69dfbd075e799dad03f5d852","id":"yixuan-tan/EgoDex-LeRobot-v3.0","author":"yixuan-tan","disabled":false,"gated":false,"lastModified":"2026-04-28T06:32:35.000Z","likes":5,"trendingScore":2,"private":false,"sha":"b2aff15977052ac687417cda246bf0d3a76fc369","downloads":4873,"tags":["region:us"],"createdAt":"2026-04-15T16:29:59.000Z","key":""},{"_id":"69e17e64564f2aa1bcd92d2f","id":"nvidia/SWE-Zero-openhands-trajectories","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-05-05T19:34:05.000Z","likes":13,"trendingScore":2,"private":false,"sha":"7b3cd106d00f60918e722d33a1d74bc67072a7ea","description":"\n\t\n\t\t\n\t\tSWE-Zero Trajectories: Execution-free Fine-tuning for Software Engineering Agents\n\t\n\n\n\t\n\t\t\n\t\tData Overview\n\t\n\nSWE-ZERO Trajectories is an agentic instruction tuning dataset designed to advance the capabilities of LLMs in software engineering. This dataset comprises 318k agent \ntrajectories collected using the OpenHands framework. The trajectories \nwere synthesized using Qwen3-Coder-480B-A35B-Instruct, specifically curated for supervised fine-tuning (SFT), \naiming to improve model… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/SWE-Zero-openhands-trajectories.","downloads":2253,"tags":["license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2604.01496","region:us","code","synthetic","tools","agents","software"],"createdAt":"2026-04-17T00:27:16.000Z","key":""},{"_id":"69e397579c4ba6fb95a726bc","id":"vietmed/ChemPatentTableQA","author":"vietmed","disabled":false,"gated":"manual","lastModified":"2026-04-29T14:04:57.000Z","likes":2,"trendingScore":2,"private":false,"sha":"de5e09d9f4efe228750c23baaddef77e5c899cae","description":"\n\t\n\t\t\n\t\tChemPatentTableQA\n\t\n\nA visual question-answering dataset built over table images extracted from chemical patents (EPO + USPTO). Each row bundles one table image, the surrounding patent context, table metadata, and a single- or multi-turn question/answer conversation with step-by-step reasoning traces.\n\n\t\n\t\t\n\t\tQuick start\n\t\n\nfrom datasets import load_dataset\n\nds = load_dataset(\"vietmed/ChemPatentTableQA\", split=\"train\")\nsample = ds[0]\nsample[\"image\"].show()           # PIL.Image —… See the full description on the dataset page: https://huggingface.co/datasets/vietmed/ChemPatentTableQA.","downloads":13,"tags":["task_categories:visual-question-answering","task_categories:question-answering","language:en","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","chemistry","patents","tables","vision-language","multi-turn"],"createdAt":"2026-04-18T14:38:15.000Z","key":""},{"_id":"69e65bab79ed4308cde96e34","id":"Pritesh-2711/pii-bench","author":"Pritesh-2711","disabled":false,"gated":false,"lastModified":"2026-06-03T09:51:32.000Z","likes":6,"trendingScore":2,"private":false,"sha":"713f76a486e83c498ab173076abc9cca63918ae3","description":"\n\t\n\t\t\n\t\n\t\n\t\tPIIBench\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDescription\n\t\n\nPIIBench is a unified benchmark dataset for PII detection across multiple domains.\n\n\t\n\t\t\n\t\n\t\n\t\tPaper\n\t\n\n\narXiv: http://arxiv.org/abs/2604.15776\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\n\nTotal records: 999,940\nEntity types: 82\nBIO labels: 165 including O\nFormat: BIO token classification with source text\n\n\n\t\n\t\t\n\t\n\t\n\t\tStructure\n\t\n\nEach example contains:\n\ntokens: list of tokens\nlabels: BIO labels\nsource: original data source of the sample\ntext:… See the full description on the dataset page: https://huggingface.co/datasets/Pritesh-2711/pii-bench.","downloads":218,"tags":["task_categories:token-classification","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2604.15776","region:us","pii","ner","privacy","benchmark"],"createdAt":"2026-04-20T17:00:27.000Z","key":""},{"_id":"69e7c0d9991ee50fa7a73aef","id":"yauheniya-adesso/icongenai-svg-captions","author":"yauheniya-adesso","disabled":false,"gated":false,"lastModified":"2026-04-23T16:26:47.000Z","likes":2,"trendingScore":2,"private":false,"sha":"fe2794c68072540a58ae28788cc55dfa42b6b520","description":"\n\t\n\t\t\n\t\tIconGenAI SVG Captions\n\t\n\nCaptioned SVG icons from the Iconify corpus, intended for fine-tuning text-to-SVG generation models.\nPart of the IconGenAI research project.\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tFiles\n\t\n\nTwo files are provided at different stages of the processing pipeline:\n\n\t\n\t\t\nFile\nRecords\nPurpose\n\n\n\t\t\nicons_captioned_merged.jsonl\n275,912\nFull license-filtered corpus with VLM-generated captions and collection metadata\n\n\nicons_training_captioned.jsonl227,821\nQuality-filtered, normalised subset… See the full description on the dataset page: https://huggingface.co/datasets/yauheniya-adesso/icongenai-svg-captions.","downloads":241,"tags":["task_categories:text-to-image","language:en","license:cc-by-nc-4.0","size_categories:100K<n<1M","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","svg","icons","captions","fine-tuning"],"createdAt":"2026-04-21T18:24:25.000Z","key":""},{"_id":"69e93175683c0c027c1b4f01","id":"project-oceania/planktonzilla-17M","author":"project-oceania","disabled":false,"gated":false,"lastModified":"2026-06-27T05:10:09.000Z","likes":5,"trendingScore":2,"private":false,"sha":"33822bb02100110194503b8793865ccdaa24e315","description":"\n\t\n\t\t\n\t\n\t\n\t\tPlanktonzilla-17M Dataset\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nplanktonzilla-17M is a large-scale, comprehensive dataset combining 17 million plankton images from all publicly available -to the best\nof our knowledge- labeled plankton datasets. This unified collection enables researchers to train robust deep learning models for plankton \nidentification and classification across diverse imaging systems and oceanographic environments.\nEach image includes a standardized taxonomic hierarchy… See the full description on the dataset page: https://huggingface.co/datasets/project-oceania/planktonzilla-17M.","downloads":2351,"tags":["task_categories:image-to-text","task_categories:image-classification","task_categories:image-text-to-text","task_categories:image-text-to-image","language:en","license:cc-by-4.0","size_categories:10M<n<100M","format:parquet","format:optimized-parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2606.00080","region:us","plankton","ocean","plankton classification","climate"],"createdAt":"2026-04-22T20:37:09.000Z","key":""},{"_id":"69e9e7af2624860c69f362c6","id":"MERChallenge/MER2026","author":"MERChallenge","disabled":false,"gated":"manual","lastModified":"2026-05-12T07:53:24.000Z","likes":13,"trendingScore":2,"private":false,"sha":"82c4ff0562ca8c4ef4ba0800ee808fb75414cffd","description":"\n\t\n\t\t\n\t\tDataset Access Form\n\t\n\nPlease follow this format before submitting the gated form. Many requests are rejected because the team information does not match the expected format.\n\n\t\n\t\t\n\t\tExample Application\n\t\n\n\n\t\n\t\t\nField\nExample\n\n\n\t\t\nTeam Name\nTongji-Affect-Lab\n\n\nTeam Leader Name\nAlice Chen\n\n\nTeam Leader Email\nalice.chen@university.edu\n\n\nTeam Members (comma-separated)\nAlice Chen, Bob Li, Carol Wang\n\n\nOrganization / University / Company\nTongji University\n\n\nCountry / Region\nChina… See the full description on the dataset page: https://huggingface.co/datasets/MERChallenge/MER2026.","downloads":4302,"tags":["language:en","license:cc-by-nc-4.0","arxiv:2604.19417","region:us"],"createdAt":"2026-04-23T09:34:39.000Z","key":""},{"_id":"69ed241581024aba54e76859","id":"demalenk/caltennis","author":"demalenk","disabled":false,"gated":false,"lastModified":"2026-05-06T10:55:32.000Z","likes":4,"trendingScore":2,"private":false,"sha":"6c1c10b2e6c16d46ab1a83614f439d2fb91e9d7d","description":"\n\t\n\t\t\n\t\tCalTennis: Large Multi-View Tennis Video Dataset\n\t\n\n\nCalTennis is a large-scale video benchmark designed for evaluating monocular-to-3D human pose estimation in the wild.\nThe dataset comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2–6 synchronized cameras at 60Hz. It is 10x larger than existing in-the-wild human motion video datasets and offers the first large-scale benchmark for synchronized multi-view recordings of expert… See the full description on the dataset page: https://huggingface.co/datasets/demalenk/caltennis.","downloads":1118,"tags":["task_categories:keypoint-detection","license:cc-by-nc-4.0","size_categories:n<1K","format:json","modality:text","modality:video","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","human-pose-estimation","3d-pose","multi-view","sports","smpl-x"],"createdAt":"2026-04-25T20:29:09.000Z","key":""},{"_id":"69f082c7c909a14a35ebbfe9","id":"disco-eth/WorldSpeech","author":"disco-eth","disabled":false,"gated":false,"lastModified":"2026-05-18T22:31:13.000Z","likes":20,"trendingScore":2,"private":false,"sha":"7fc2c2f19528b3d3972110a04e500098f6fc7f24","description":"\n\t\n\t\t\n\t\tWorldSpeech\n\t\n\nA multilingual ASR dataset containing over 65k hours of human transcribed speech across 127 language-region variants, drawn from national parliaments, public broadcasters, public-domain audiobooks, and international institutions. Rows consist of 24 kHz speech utterances paired with a human-provided transcript, an aligned ASR transcript, character error rate (CER) between the two, a WADA-SNR estimate, and four DNSMOS-P.835 quality scores.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Overview… See the full description on the dataset page: https://huggingface.co/datasets/disco-eth/WorldSpeech.","downloads":26604,"tags":["task_categories:automatic-speech-recognition","task_categories:text-to-speech","task_categories:audio-classification","language:af","language:am","language:ar","language:az","language:be","language:bn","language:ca","language:ckb","language:cnr","language:crs","language:cs","language:de","language:dv","language:el","language:en","language:eo","language:es","language:fa","language:fr","language:ga","language:grc","language:ha","language:he","language:hi","language:hu","language:hy","language:id","language:ig","language:iu","language:ja","language:ka","language:kk","language:km","language:ko","language:la","language:lb","language:lo","language:mfe","language:mi","language:ml","language:mn","language:mr","language:ms","language:my","language:ne","language:nl","language:nr","language:nso","language:om","language:pa","language:pl","language:pt","language:rm","language:ro","language:ru","language:rw","language:si","language:sm","language:sn","language:sq","language:ss","language:st","language:sv","language:sw","language:ta","language:th","language:ti","language:tl","language:tn","language:tr","language:ts","language:ug","language:uz","language:ve","language:vi","language:xh","language:yue","language:zh","language:zu","license:cc-by-nc-4.0","size_categories:10M<n<100M","modality:audio","arxiv:2605.09167","doi:10.57967/hf/8660","region:us","speech","multilingual","low-resource","parliamentary","asr","tts","audio"],"createdAt":"2026-04-28T09:49:59.000Z","key":""},{"_id":"69f3abbade4c23a0f1ec5fc9","id":"microsoft/GridSFM_US_power_grid","author":"microsoft","disabled":false,"gated":false,"lastModified":"2026-05-25T13:21:00.000Z","likes":6,"trendingScore":2,"private":false,"sha":"2d770f48cb79cd8e5eae662a780edcb3ecec6c14","description":"\n\t\n\t\t\n\t\tGridSFM US Power Grid Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nGridSFM US Power Grid Dataset is a set of geographically grounded, electrically coherent power-system network derived entirely from publicly available data. It was developed to support AC optimal power flow (AC-OPF) analysis, enabling physics-based study of congestion, capacity, and demand sitting without restricted data.\nA detailed discussion of GridSFM US Power Grid Dataset, including how it was developed and evaluated, can be… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/GridSFM_US_power_grid.","downloads":533,"tags":["task_categories:other","license:mit","arxiv:2605.04289","region:us","power-systems","optimal-power-flow","energy","power-grid-modeling","openstreetmap","us-eia","us-census"],"createdAt":"2026-04-30T19:21:30.000Z","key":""},{"_id":"69f638c8ebff1de2d6753093","id":"GokuScraper/seedance-2-prompts-datasets","author":"GokuScraper","disabled":false,"gated":false,"lastModified":"2026-06-30T03:13:53.000Z","likes":7,"trendingScore":2,"private":false,"sha":"2b358646ef8f4c0ee8472c3262bad8441a0689a5","description":"\n\t\n\t\t\n\t\n\t\n\t\t🎞️ Seedance-2-prompts-datasets\n\t\n\n   \n\n🎞️ The ultimate Seedance-2 video prompt dataset (12GB+). 2000+ video generation prompts with full metadata and preview frames. Truly open source: No login, no ads, no redirection. Just pure data for AI video creators.\n\nThis project is a massive collection of prompts used for Bytedance's Seedance 2.0 and the resulting generated videos. The entire dataset exceeds 12GB and contains 2000+ videos, all structured into a comprehensive dataset.\nDue… See the full description on the dataset page: https://huggingface.co/datasets/GokuScraper/seedance-2-prompts-datasets.","downloads":100298,"tags":["task_categories:text-to-video","language:en","language:zh","license:cc-by-4.0","size_categories:1K<n<10K","format:json","modality:image","modality:text","modality:video","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","video-prompt","seedance-2","prompt-engineering","prompt-dataset","video-generation"],"createdAt":"2026-05-02T17:47:52.000Z","key":""},{"_id":"69f86fba90085d3085713617","id":"EuniAI/TerminalWorld","author":"EuniAI","disabled":false,"gated":false,"lastModified":"2026-06-15T16:57:51.000Z","likes":6,"trendingScore":2,"private":false,"sha":"dda7c099cc076735aef28c03bf8d3624dc0564e1","description":"\n\t\n\t\t\n\t\n\t\n\t\tTerminalWorld\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nTerminalWorld is a benchmark dataset for evaluating AI agents on real-world terminal and command-line tasks. It contains 1,530 terminal-based tasks reverse-engineered from publicly available terminal recordings, covering domains such as data processing, system administration, networking, security, version control, containers and orchestration, debugging and testing, environment setup, and scientific computing.\nEach task includes a… See the full description on the dataset page: https://huggingface.co/datasets/EuniAI/TerminalWorld.","downloads":5888,"tags":["task_categories:text-generation","task_categories:reinforcement-learning","task_categories:question-answering","language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2605.22535","region:us","terminal-agents","coding-agents","cli","benchmark","evaluation","agent-evaluation","arxiv:2605.22535"],"createdAt":"2026-05-04T10:06:50.000Z","key":""},{"_id":"69f963c302a11424bc61c766","id":"nvidia/PhysicalAI-WorldModel-Synthetic-Autonomous-Driving-Scenarios","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-09T18:27:54.000Z","likes":17,"trendingScore":2,"private":false,"sha":"c6fceba249770c5f8c61c7a6addab847003dca4f","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Description: \n\t\n\nPhysicalAI-WorldModel-Synthetic-Autonomous-Driving-Scenarios is a large-scale synthetic video dataset of autonomous-driving scenes generated with NVIDIA's internal Omniverse simulation platform. Each clip is a temporally consistent multi-camera surround capture of one ego vehicle and surrounding traffic participants, paired with per-camera VLM captions. The dataset is designed to fill gaps in real-world driving data along two axes: (1) targeted long-tail… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Autonomous-Driving-Scenarios.","downloads":170935,"tags":["language:en","license:other","size_categories:100K<n<1M","modality:video","region:us","video","driving","autonomous","vehicle","police","pedestrian","emergency","lanechange","physical-ai"],"createdAt":"2026-05-05T03:28:03.000Z","key":""},{"_id":"69fa94c2dae9b5bbe9c86e84","id":"Inferact/codex_swebenchpro_traces","author":"Inferact","disabled":false,"gated":false,"lastModified":"2026-05-07T01:13:21.000Z","likes":25,"trendingScore":2,"private":false,"sha":"0d52ae8c75738117be9e58c7071bd9a5b43ff78f","description":"This is a dataset generated by real swebenchpro agentic workload trace + codex agent.\n\n\n\t\n\t\t\n\t\t1. Eval Result Summary\n\t\n\n\n\t\n\t\t\nMetric\nValue\n\n\n\t\t\nTotal trials\n731\n\n\nSuccessful trials\n610\n\n\nFailed trials\n120\n\n\nNo data (skipped)\n1\n\n\nPassed\n329\n\n\nPass rate (of successful)\n53.9%\n\n\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tPer-Repo Breakdown\n\t\n\n\n\t\n\t\t\nRepo\nTotal\nSuccess\nFailed\nPassed\nPass%\n\n\n\t\t\nansible/ansible\n96\n93\n3\n60\n65%\n\n\ninternetarchive/openli\n91\n88\n3\n52\n59%\n\n\nflipt-io/flipt85\n82\n3\n26\n32%\n\n\nqutebrowser/qutebrowse\n79\n78\n1… See the full description on the dataset page: https://huggingface.co/datasets/Inferact/codex_swebenchpro_traces.","downloads":716,"tags":["license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-05-06T01:09:22.000Z","key":""},{"_id":"69fc9ace5e0d5b0eb7a969e5","id":"Exgentic/agent-llm-traces","author":"Exgentic","disabled":false,"gated":false,"lastModified":"2026-06-07T07:55:41.000Z","likes":20,"trendingScore":2,"private":false,"sha":"70036b93a04e61b0ea2706a68b962f4f26774587","description":"\n\t\n\t\t\n\t\n\t\n\t\tMulti-Benchmark LLM Agent Traces\n\t\n\nA comprehensive dataset of OpenTelemetry traces capturing LLM inference behavior across multiple agent frameworks, benchmarks, and model providers. This dataset enables research into LLM performance analysis, agent behavior patterns, and inference optimization.\nCollected by Exgentic - A platform for LLM observability and performance optimization.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Overview\n\t\n\nThis dataset contains 1,781 execution traces capturing detailed agent… See the full description on the dataset page: https://huggingface.co/datasets/Exgentic/agent-llm-traces.","downloads":4992,"tags":["task_categories:text-generation","language:en","license:cdla-permissive-2.0","size_categories:1K<n<10K","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","llm","traces","opentelemetry","benchmarks","agents"],"createdAt":"2026-05-07T13:59:42.000Z","key":""},{"_id":"69fcb280a4baf2cf3575978a","id":"Rice-RobotPI-Lab/egoinfinity","author":"Rice-RobotPI-Lab","disabled":false,"gated":false,"lastModified":"2026-06-18T18:11:44.000Z","likes":6,"trendingScore":2,"private":false,"sha":"1e3fe515f9a09b0161ead4e2bf0ccc5a1b9f708b","description":"\n\t\n\t\t\n\t\n\t\n\t\tEgoInfinity (preview)\n\t\n\nDerivative scene assets for a curated subset of Action100M (Meta FAIR) clips.\nPreview — not for general release. Schema and contents may change.\n\n\t\n\t\t\n\t\n\t\n\t\tLicense\n\t\n\nFAIR Noncommercial Research License v1 (see LICENSE-Action100M). Noncommercial research only.\nBuilt by the Rice RobotPI Lab.\n\n\t\n\t\t\n\t\n\t\n\t\tRetarget\n\t\n\nFor 104 of the clips under samples/, we provide retargeting results on four\nrobot embodiments under samples/<clip>/retarget/<robot>/:\n\nfranka —… See the full description on the dataset page: https://huggingface.co/datasets/Rice-RobotPI-Lab/egoinfinity.","downloads":4630,"tags":["task_categories:robotics","license:other","size_categories:1K<n<10K","modality:3d","modality:video","library:datasets","library:mlcroissant","region:us","egocentric","imitation-learning","retargeting","manipulation","action100m"],"createdAt":"2026-05-07T15:40:48.000Z","key":""},{"_id":"69fd34bb91dd9dbd82db096d","id":"Yuanhao-Harry-Wang/fitvto-100k","author":"Yuanhao-Harry-Wang","disabled":false,"gated":false,"lastModified":"2026-06-14T10:51:28.000Z","likes":5,"trendingScore":2,"private":false,"sha":"5563646729edf148ed2b32c4b9c794d51a1bc828","description":"\n\t\n\t\t\n\t\n\t\n\t\tFIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On\n\t\n\nThe official preview dataset from the paper \"FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On\".\nThis dataset supports garment-centric virtual try-on and try-off research, containing 100,000 training and 5,000 evaluation triplets. Each sample pairs a person image with a layflat garment image and body/garment measurements.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\nEach split (train / eval) contains four aligned modalities — all… See the full description on the dataset page: https://huggingface.co/datasets/Yuanhao-Harry-Wang/fitvto-100k.","downloads":1161,"tags":["task_categories:image-to-image","language:en","license:cc-by-nc-nd-4.0","size_categories:100K<n<1M","format:parquet","format:optimized-parquet","modality:image","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","virtual-try-on","fashion","garment","try-off","clothing"],"createdAt":"2026-05-08T00:56:27.000Z","key":""},{"_id":"6a041ab186ebfeb767465f0b","id":"zlab-princeton/i1-captions","author":"zlab-princeton","disabled":false,"gated":false,"lastModified":"2026-06-12T02:01:04.000Z","likes":19,"trendingScore":2,"private":false,"sha":"bb8c4a4da111c1e0b2a0afa53d381ec57b98ad19","description":"i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models\nBoya Zeng, Tianze Luo, Shu Pu, Jucheng Shen, Taiming Lu, Gabriel Sarch, Zhuang Liu\nPrinceton University\n[arXiv][code][model][project page]\n\n    \n\n\n\n\t\n\t\t\n\t\n\t\n\t\t1. Overview\n\t\n\nThis dataset contains all captions used in our controlled experiments and the final training of the i1 model. Detailed instructions for downloading the corresponding images and matching the image-caption pairs can be found in our codebase.\n\n\t\n\t\t\n\t\n\t\n\t\t2.… See the full description on the dataset page: https://huggingface.co/datasets/zlab-princeton/i1-captions.","downloads":5129,"tags":["task_categories:text-to-image","size_categories:100M<n<1B","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2606.11289","region:us"],"createdAt":"2026-05-13T06:31:13.000Z","key":""},{"_id":"6a07052645a1d8ca70ef7e0c","id":"Rolv-Arild/hlrlrd-parsed","author":"Rolv-Arild","disabled":false,"gated":false,"lastModified":"2026-05-16T10:06:28.000Z","likes":3,"trendingScore":2,"private":false,"sha":"318fdb72c70f30c42a4a7b6bf8e9d2972051d0f6","description":"The High-Level Rocket League Replay Dataset, parsed into rlgym-tools ReplayFrame objects, and serialized into numpy arrays.\nTo stream or download this dataset, you can use this code as reference. Note that finding all the remote files can take some time.\nimport os\nimport glob\nimport numpy as np\nfrom concurrent.futures import ThreadPoolExecutor\nfrom huggingface_hub import DatasetCard, HfFileSystem, snapshot_download\nfrom rlgym_tools.rocket_league.misc.serialize import deserialize_replay_frame… See the full description on the dataset page: https://huggingface.co/datasets/Rolv-Arild/hlrlrd-parsed.","downloads":2282,"tags":["region:us"],"createdAt":"2026-05-15T11:36:06.000Z","key":""},{"_id":"6a085fa4944148ccb2d85ee5","id":"LukaDev13/Liminal-Dreamcore-1K","author":"LukaDev13","disabled":false,"gated":false,"lastModified":"2026-05-18T22:03:11.000Z","likes":34,"trendingScore":2,"private":false,"sha":"29d8ecc0e0ac76dc5098851face10eb6848d85ca","description":"\n\n\n\t\n\t\t\n\t\tDreamcore\n\t\n\n\n\t\n\t\t\n\t\tA Collection of 1000 AI-Generated Dreamcore Aesthetic Images\n\t\n\n\n\n\n\n\nAll images in this collection are AI-generated.\n\n\n\n\n\t\n\t\t\n\t\tArchitecture\n\t\n\nThe generation pipeline behind this collection:\n\n\n\n\n\n\n\n\t\n\t\t\n\t\tWhat Is Dreamcore?\n\t\n\nDreamcore is an internet aesthetic that captures the visual language of dreams - specifically the strange, liminal, half-remembered quality of dream imagery. It sits in the same family as weirdcore, traumacore, and oddcore, but has its own… See the full description on the dataset page: https://huggingface.co/datasets/LukaDev13/Liminal-Dreamcore-1K.","downloads":595,"tags":["license:mit","modality:image","region:us","ai-generated","dreamcore","aesthetic","image-collection","gpt-image"],"createdAt":"2026-05-16T12:14:28.000Z","key":""},{"_id":"6a097e82f1db8669a4da66b0","id":"STLAND/ZJU_OLAT","author":"STLAND","disabled":false,"gated":"manual","lastModified":"2026-06-29T08:47:07.000Z","likes":2,"trendingScore":2,"private":false,"sha":"c619732e24196a709363c33830e892693b179059","description":"\n\t\n\t\t\n\t\n\t\n\t\tThis is the repo for ZJU_OLAT_DATASET\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nWe have two subset as introduced in our paper: dynamic-subset(Jcam) and static-subset(BMD), which named with the captured cameras.\n\n\t\n\t\t\n\t\n\t\n\t\tStatic-subet(BMD)\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tThere is some special issue we need to clarify of this subet:\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tlicense: apache-2.0\n\t\n\n","downloads":11,"tags":["region:us"],"createdAt":"2026-05-17T08:38:26.000Z","key":""},{"_id":"6a0aed2a95ec508059df47c8","id":"etri-lirs/KoTSQA-v.2.0","author":"etri-lirs","disabled":false,"gated":false,"lastModified":"2026-06-23T02:29:51.000Z","likes":3,"trendingScore":2,"private":false,"sha":"ff9349df469a765b4561959e36ef1b3f377765cd","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for Dataset Name\n\t\n\n한국어 시간민감형 질의응답 데이터 (Korean Time-Sensitive Q&A Dataset: KoTSQA)\n\nCopyright (C) 2025~ ETRI LIRS.\nkoTSQA 데이터셋의 Split Version. train/validation 등의 용도로 사용하게 하고, test 6,750개, train 750 (강화학습에 활용)임\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\nTrain Part\n\nStats - Unchanged: 243(0.3240), changed: 264(0.3520), new: 243(0.3240),\nStats w/o false-premise - Unchanged: 181(0.3285), changed: 194(0.3521), new: 176(0.3194),\nStats of false-premise - Unchanged: 62(0.3116)… See the full description on the dataset page: https://huggingface.co/datasets/etri-lirs/KoTSQA-v.2.0.","downloads":106,"tags":["language:ko","license:cc-by-sa-4.0","size_categories:1K<n<10K","format:parquet","modality:document","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","질의응답,","시간민감형질의응답,","검색증강생성,","RAG,","QA,"],"createdAt":"2026-05-18T10:42:50.000Z","key":""},{"_id":"6a0bde24f8d23d4248aa0a23","id":"Jackrong/Claude-opus-4.7-TraceInversion-5000x","author":"Jackrong","disabled":false,"gated":false,"lastModified":"2026-05-19T10:20:17.000Z","likes":70,"trendingScore":2,"private":false,"sha":"ab3b48f1d461ec40af924fd3163d2b9c8eaeb07c","description":"\n  \n    \n      🌀 Claude-opus-4.7-TraceInversion-5000x\n      v1.0 Release\n    \n    A High-Fidelity Reconstructed CoT Dataset Saturated with the 'Opus Deep Logic Style' via Trace Inversion\n    📊 5,000 Samples\n    🧬 Trace Inversion & Negentropy\n    🛠 SFT & DPO Ready\n    🔥 Claude 4.7-Max Distillation\n    🌐 English & Multilingual\n  \n  \n    \n      💡 What is Trace Inversion?\n      In Large Language Model (LLM) reasoning distillation, proprietary API models (such as GPT-4/5 and Claude)… See the full description on the dataset page: https://huggingface.co/datasets/Jackrong/Claude-opus-4.7-TraceInversion-5000x.","downloads":2294,"tags":["task_categories:text-generation","annotations_creators:machine-generated","language:en","language:zh","language:ko","language:ru","language:ja","language:es","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2603.07267","region:us","reasoning","trace-inversion","synthetic-data","chain-of-thought","distillation","claude-opus","negentropy","qwen","unsloth"],"createdAt":"2026-05-19T03:51:00.000Z","key":""},{"_id":"6a0d0c6c9749fe274b235f51","id":"WithinUsAI/gemini_3.5_flash_distilled_25k","author":"WithinUsAI","disabled":false,"gated":false,"lastModified":"2026-05-20T01:24:55.000Z","likes":5,"trendingScore":2,"private":false,"sha":"9f0ee56fc4158c2bfa8f5e72c9d576a6cae563c4","description":"\n\t\n\t\t\n\t\n\t\n\t\tGemini 3.5 Flash Distilled Dataset (25k)\n\t\n\nA 25,000-sample synthetic distilled dataset designed to replicate the core capabilities of Gemini 3.5 Flash: frontier-level agentic execution, rapid multi-step reasoning, dense context analysis, and advanced autonomous coding — all optimized for low-latency inference.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nThis dataset was created via template-based evolutionary synthesis with content-normalized SHA-256 deduplication. Every sample features… See the full description on the dataset page: https://huggingface.co/datasets/WithinUsAI/gemini_3.5_flash_distilled_25k.","downloads":271,"tags":["language:en","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","gemini","distillation","agentic","code-generation","reasoning","multimodal","instruction-following","synthetic","jsonl","zero-duplicates"],"createdAt":"2026-05-20T01:20:44.000Z","key":""},{"_id":"6a0d90ebddfbf2f7cbe897f3","id":"Roman1111111/gpt5.5-terminal","author":"Roman1111111","disabled":false,"gated":false,"lastModified":"2026-05-20T10:47:10.000Z","likes":4,"trendingScore":2,"private":false,"sha":"00283c9ffdefc5b0bfdca66cd10fdbdcb7fc2fe1","downloads":112,"tags":["license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-05-20T10:46:03.000Z","key":""},{"_id":"6a153f0b85e02991e81333d0","id":"uw-math-ai/math-graph","author":"uw-math-ai","disabled":false,"gated":false,"lastModified":"2026-06-28T21:56:05.000Z","likes":2,"trendingScore":2,"private":false,"sha":"ced4ca9de1bd9e5b67aa09d1d515e270e438fa1e","description":"\n\t\n\t\t\n\t\n\t\n\t\tMath-Graph\n\t\n\nMath-Graph is the dataset behind TheoremGraph, a unified, statement-level dependency\ngraph spanning both informal and formal mathematics. On the informal side it parses millions of\ntheorem-like environments from mathematics arXiv and recovers directed dependency edges within and\nacross papers; on the formal side it releases LeanGraph, an elaborator-level extraction of typed\ndeclaration dependencies across 25 Lean 4 projects. The two graphs are bridged into one… See the full description on the dataset page: https://huggingface.co/datasets/uw-math-ai/math-graph.","downloads":75,"tags":["task_categories:text-retrieval","task_categories:feature-extraction","language:en","license:cc-by-4.0","size_categories:10M<n<100M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.25363","region:us","mathematics","theorem-proving","lean","arxiv","knowledge-graph","dependency-graph","autoformalization"],"createdAt":"2026-05-26T06:34:51.000Z","key":""},{"_id":"6a1728ec336e9daa66d90494","id":"ServiceNow/PrivacyAlign","author":"ServiceNow","disabled":false,"gated":false,"lastModified":"2026-06-17T20:57:25.000Z","likes":2,"trendingScore":2,"private":false,"sha":"edaf507a2d1019b03b592b92a60d113b7eb46425","description":"\n\t\n\t\t\n\t\n\t\n\t\tPrivacyAlign\n\t\n\nPrivacyAlign is a human-annotated preference dataset for training and evaluating privacy-aligned tool-use agents. Each row pairs two candidate final actions from different models for the same agentic scenario, along with human preference labels and per-response privacy annotations (leaks and omissions).\nThe scenarios are synthetic. The user names, emails, memories, and tool trajectories are all generated, and no real user data is included.\n\n\t\n\t\t\n\t\n\t\n\t\tSplits… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/PrivacyAlign.","downloads":56,"tags":["language:en","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","privacy","tool-use","agents"],"createdAt":"2026-05-27T17:25:00.000Z","key":""},{"_id":"6a17a0e81a02b78b84f8fb52","id":"purvanshi/TASTE","author":"purvanshi","disabled":false,"gated":false,"lastModified":"2026-05-29T11:32:07.000Z","likes":9,"trendingScore":2,"private":false,"sha":"731a7f588d433214c6d864d2e9f47978d91aed6b","description":"\n\t\n\t\t\n\t\n\t\n\t\tTASTE: Human Preferences for Design-Quality Image Comparison\n\t\n\nThis dataset is the human-evaluation corpus released alongside the\nTASTE preference model.  It contains panel rankings of generated\nimages across multiple quality dimensions — both aesthetic (does the\nimage look good?) and description-faithfulness (does the image match\nwhat the prompt describes?) — plus a per-image hallucination\njudgement.\n\n\t\n\t\t\n\t\n\t\n\t\tQuick stats\n\t\n\n\n\t\n\t\t\nTable\nRows\nNotes\n\n\n\t\t\nprompts.parquet\n~200\none… See the full description on the dataset page: https://huggingface.co/datasets/purvanshi/TASTE.","downloads":983,"tags":["task_categories:image-classification","task_categories:image-to-text","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","preference-learning","vision-language","design","aesthetics","reward-model"],"createdAt":"2026-05-28T01:56:56.000Z","key":""},{"_id":"6a18b704a8151acebfaaf374","id":"OpenClaw/clawhub-security-signals","author":"OpenClaw","disabled":false,"gated":false,"lastModified":"2026-06-27T01:39:08.000Z","likes":48,"trendingScore":2,"private":false,"sha":"69dcbd323c155312fb000ec89ea0b1efdf6a5757","description":"\n\t\n\t\t\n\t\n\t\n\t\tClawHub Security Signals\n\t\n\n🦀 ClawHub | 📝 OpenClaw Blog | 🤗 Hugging Face Blog | 📄 Paper | 📄 Pre-Print\nClawHub Security Signals is a sanitized, MIT-licensed security-signals dataset for public OpenClaw agent skills. It captures how an agent-skill registry evaluates trust, provenance, bundled code, and scanner evidence at scale.\nThis dataset was presented in the paper ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree.\n\nPaper snapshot: this… See the full description on the dataset page: https://huggingface.co/datasets/OpenClaw/clawhub-security-signals.","downloads":853,"tags":["task_categories:text-classification","task_ids:multi-class-classification","language:en","license:mit","size_categories:10K<n<100K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.01494","region:us","security","llm-security","agentic-ai","agent-skills","openclaw","clawhub","malware-detection","static-analysis","software-supply-chain","skillspector","owasp","scanner-disagreement","trust-and-safety","mlcroissant"],"createdAt":"2026-05-28T21:43:32.000Z","key":""},{"_id":"6a18f4fcb974a76b6914eecd","id":"cua-lite/GUIAct","author":"cua-lite","disabled":false,"gated":false,"lastModified":"2026-06-18T06:38:44.000Z","likes":2,"trendingScore":2,"private":false,"sha":"1fd3bd1732ae2f21135ee2d5b2757b08ac1d1195","description":"\n\t\n\t\t\n\t\n\t\n\t\tcua-lite/GUIAct\n\t\n\ncua-lite preprocessed version of yiye2023/GUIAct. GUI action data spanning web single-step grounding.action (web-single), web multi-step navigation (web-multi), and Android multi-step navigation (smartphone). Coordinates are normalized to [0, 1000]; the upstream test split is honored as the validation split.\n\n\t\n\t\t\n\t\n\t\n\t\tOrigin\n\t\n\n\nhttps://huggingface.co/datasets/yiye2023/GUIAct\n\n\n\t\n\t\t\n\t\n\t\n\t\tLoad via datasets\n\t\n\nfrom datasets import load_dataset\n\n# entire dataset… See the full description on the dataset page: https://huggingface.co/datasets/cua-lite/GUIAct.","downloads":1368,"tags":["task_categories:image-text-to-text","license:other","size_categories:10K<n<100K","format:parquet","format:optimized-parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","cua-lite","gui","sft"],"createdAt":"2026-05-29T02:07:56.000Z","key":""},{"_id":"6a1b276fcff89db09a24e6af","id":"SWE-Lego/Terminal-Lego-15k","author":"SWE-Lego","disabled":false,"gated":false,"lastModified":"2026-06-04T09:12:02.000Z","likes":4,"trendingScore":2,"private":false,"sha":"9c197f1c2e87b64cc316b1a5bfcef57b584929f0","description":"\n\t\n\t\t\n\t\n\t\n\t\tTerminal-Lego-15k: Docker-Verified Terminal Agent Tasks from Real StackOverflow Issues\n\t\n\nProject Page · Code · Paper · Models\nTerminal-Lego-15k is a large-scale collection of Docker-verified, Terminal-Bench-style agentic tasks built from real StackOverflow technical issues. Each task is constructed through the Terminal-Lego pipeline: StackOverflow questions are filtered, converted via cascaded LLM generation, and retained only after Docker round-trip verification.\nThis dataset is… See the full description on the dataset page: https://huggingface.co/datasets/SWE-Lego/Terminal-Lego-15k.","downloads":9375,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:10K<n<100K","arxiv:2606.03461","region:us","terminal","code-agent","stackoverflow","docker","benchmark"],"createdAt":"2026-05-30T18:07:43.000Z","key":""},{"_id":"6a1d9f3d7e7f1d1a9a4c2416","id":"Artificio/robusto-2","author":"Artificio","disabled":false,"gated":false,"lastModified":"2026-06-24T19:19:45.000Z","likes":2,"trendingScore":2,"private":false,"sha":"db72f46dc648356a2c4768c463b732397063169e","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset: Robusto-2\n\t\n\nPaper Link on ArXiv: https://arxiv.org/abs/2606.20980\n\n\t\n\t\t\n\t\n\t\n\t\tDescription\n\t\n\nThis dataset contains 20 videos, which were specifically used in this paper. These videos were selected from a larger set of 200 dashcam videos recorded in various cities across Peru (Lima) and New York City (NYC), available as an extended dataset. They are split evenly by region — 10 from Lima/Peru and 10 from NYC — so model and human behavior can be compared across a familiar… See the full description on the dataset page: https://huggingface.co/datasets/Artificio/robusto-2.","downloads":155,"tags":["task_categories:visual-question-answering","task_categories:video-classification","task_categories:video-text-to-text","language:en","license:cc-by-nc-4.0","modality:video","arxiv:2606.20980","region:us","arxiv:2606.20980"],"createdAt":"2026-06-01T15:03:25.000Z","key":""},{"_id":"6a1ee2ed0cee96c6fafce91b","id":"sepiq-2026/SEPIQ-2026-training-data","author":"sepiq-2026","disabled":false,"gated":"manual","lastModified":"2026-06-15T16:13:35.000Z","likes":6,"trendingScore":2,"private":false,"sha":"9ab83abe5117affc345b659d5204acc65aedff14","description":"\n\n\n\t\n\t\t\n\t\n\t\n\t\tSEPIQ 2026 Training Data\n\t\n\nSequence-Based Epitope Prediction Intelligent Query\nGated training data repository for the SEPIQ 2026 Challenge.\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nSEPIQ 2026 is a community challenge for benchmarking computational methods that predict antibody–antigen interaction sites from VHH nanobody sequence data.\nThis repository will provide access to the training data approved for release to accepted SEPIQ Challenge participants. Access is gated so that the organizing… See the full description on the dataset page: https://huggingface.co/datasets/sepiq-2026/SEPIQ-2026-training-data.","downloads":136,"tags":["license:other","size_categories:1M<n<10M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","biology","antibodies","VHH","nanobody","epitope-prediction","protein-structure","challenge"],"createdAt":"2026-06-02T14:04:29.000Z","key":""},{"_id":"6a1f33285b70251eb1f1ef8b","id":"Yale-BIDS-Chen/medpmc-11m-dataset_jun24_baseline","author":"Yale-BIDS-Chen","disabled":false,"gated":false,"lastModified":"2026-06-05T13:44:59.000Z","likes":2,"trendingScore":2,"private":false,"sha":"cd8d31705bf52ea9588d56d62b3d260a56b2a2c1","description":"\n\t\n\t\t\n\t\n\t\n\t\tMedPMC WebDataset\n\t\n\nMedPMC is a large-scale medical image-text dataset curated from articles in the PubMed Central (PMC) collection. This release contains approximately 11 million image-text pairs collected from the June 2024 PMC baseline. MedPMC is an ongoing effort, and future releases will continue to expand the dataset with newly published literature, improved annotations, and additional resources.\nCompared with raw PMC resources, MedPMC introduces two major improvements.\n(1)… See the full description on the dataset page: https://huggingface.co/datasets/Yale-BIDS-Chen/medpmc-11m-dataset_jun24_baseline.","downloads":1539,"tags":["license:cc-by-nc-sa-4.0","size_categories:1M<n<10M","format:webdataset","modality:image","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us"],"createdAt":"2026-06-02T19:46:48.000Z","key":""},{"_id":"6a1fcf78aa35c86b3f238eb0","id":"AgentCyberRange/WebExploitBench","author":"AgentCyberRange","disabled":false,"gated":"manual","lastModified":"2026-06-16T12:12:59.000Z","likes":3,"trendingScore":2,"private":false,"sha":"0d49548b717018822d988d1fe770def0f65541b9","description":"\n\t\n\t\t\n\t\n\t\n\t\tWebExploitBench\n\t\n\nWebExploitBench is the web exploitation benchmark under AgentCyberRange.\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nWebExploitBench is a benchmark for evaluating real-world web exploitation in isolated targets. The complete dataset contains 15 self-contained web applications with 110 documented vulnerabilities.\nEach application includes the target environment, exploits, reports, and verifiers. WebExploitBench contains 0-day, 1-day, and synthetic vulnerabilities, covering common… See the full description on the dataset page: https://huggingface.co/datasets/AgentCyberRange/WebExploitBench.","downloads":113,"tags":["license:apache-2.0","arxiv:2606.14295","region:us","benchmark"],"createdAt":"2026-06-03T06:53:44.000Z","key":""},{"_id":"6a1fcfc805354154af952a53","id":"AgentCyberRange/PostExploitBench","author":"AgentCyberRange","disabled":false,"gated":"manual","lastModified":"2026-06-16T12:12:15.000Z","likes":4,"trendingScore":2,"private":false,"sha":"fc477ae1960f49ec8416da97785aa9b3ffc55cc6","description":"\n\t\n\t\t\n\t\n\t\n\t\tPostExploitBench\n\t\n\nPostExploitBench is the post exploitation benchmark under AgentCyberRange.\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nPostExploitBench is a benchmark for multi-host post-exploitation tasks. The complete dataset contains 8 self-contained cyber ranges with 156 target hosts.\nEach range-N/ models an isolated enterprise-like cyber range with entry services, internal networks, pivot hosts, vulnerable targets, supporting services, and decoys for evaluating multi-stage compromise.… See the full description on the dataset page: https://huggingface.co/datasets/AgentCyberRange/PostExploitBench.","downloads":57,"tags":["license:apache-2.0","arxiv:2606.14295","region:us","benchmark","cyber range"],"createdAt":"2026-06-03T06:55:04.000Z","key":""},{"_id":"6a2022ef86c5a136888c3a9c","id":"deem-data/ArtiFact","author":"deem-data","disabled":false,"gated":false,"lastModified":"2026-06-08T13:32:22.000Z","likes":5,"trendingScore":2,"private":false,"sha":"020c62d543a4c884c08e181842c0c83468db6bb7","description":"\n\t\n\t\t\n\t\n\t\n\t\tArtiFact\n\t\n\nArtiFact is a large-scale multimodal benchmark of museum artwork records with aligned images and structured metadata. It is designed for evaluating metadata extraction, error detection, semantic querying, and multimodal reasoning over cultural-heritage collections.\nThe dataset combines records from the Rijksmuseum, the Metropolitan Museum of Art (Met), and the Art Institute of Chicago (AIC), with normalized fields for artists, dates, materials, techniques, dimensions… See the full description on the dataset page: https://huggingface.co/datasets/deem-data/ArtiFact.","downloads":355,"tags":["task_categories:image-classification","task_categories:text-classification","task_categories:question-answering","task_categories:object-detection","language:en","language:nl","license:cc-by-2.0","size_categories:100K<n<1M","modality:image","modality:text","region:us","art","museum","multimodal","data-cleaning","error-detection","cultural-heritage","table","image","text"],"createdAt":"2026-06-03T12:49:51.000Z","key":""},{"_id":"6a2109cc17dad4ca26c80eb7","id":"MINT-SJTU/RW-RL-Dataset","author":"MINT-SJTU","disabled":false,"gated":false,"lastModified":"2026-06-30T03:09:48.000Z","likes":8,"trendingScore":2,"private":false,"sha":"db8d5513c4917ecc64be9409c4deb2d3c7663161","description":"\n\t\n\t\t\n\t\n\t\n\t\tRW-RL Dataset: Real-World Reinforcement Learning for Robots\n\t\n\n\n  \n    \n  \n  \n    \n  \n  \n  \n\n\nRW-RL Dataset is a real-world robot interaction dataset released by Boden Intelligence, Junpu Innovation Center, and the MINT Lab at Shanghai Jiao Tong University. It is designed for a bottleneck that imitation-learning-only datasets do not directly solve: how can robot policies keep improving in the physical world after they leave the ideal demonstration trajectory?\nThe broader RW-RL data… See the full description on the dataset page: https://huggingface.co/datasets/MINT-SJTU/RW-RL-Dataset.","downloads":6356,"tags":["language:en","size_categories:10K<n<100K","modality:video","region:us","robotics","reinforcement-learning","real-world-robotics","embodied-ai","teleoperation","human-intervention","multimodal"],"createdAt":"2026-06-04T05:14:52.000Z","key":""},{"_id":"6a22b7f71e915a961d8d531d","id":"ARTPARK-IISc/Vaani-Benchmark-V1.0","author":"ARTPARK-IISc","disabled":false,"gated":"auto","lastModified":"2026-06-29T06:56:12.000Z","likes":4,"trendingScore":2,"private":false,"sha":"1bf019521d12d742178acc32bf2a42f81cf7c8ef","description":"\n\t\n\t\t\n\t\n\t\n\t\tVaani-Benchmark-V1.0\n\t\n\nA curated Hindi ASR evaluation set drawn from the Vaani project. This benchmark contains 5,050 audio segments from 1,103 speakers across 104 Indian districts, each with three independent human transcriptions.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\n\n\t\n\t\t\nProperty\nValue\n\n\n\t\t\nLanguage\nHindi (with code-switching)\n\n\nSegments\n5,050\n\n\nSpeakers\n1,103\n\n\nDuration\n~10.9 hours\n\n\nDistricts\n104 across 16 Indian states\n\n\nTranscriptions per segment\n3 (independent human annotators)… See the full description on the dataset page: https://huggingface.co/datasets/ARTPARK-IISc/Vaani-Benchmark-V1.0.","downloads":197,"tags":["benchmark:official","benchmark:eval-yaml","task_categories:automatic-speech-recognition","source_datasets:ARTPARK-IISc/Vaani","language:hi","license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:audio","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2603.28714","region:us","hindi","speech","asr","evaluation","india","multilingual","code-switching"],"createdAt":"2026-06-05T11:50:15.000Z","key":""},{"_id":"6a23971aed8b6eeac4e4fef0","id":"GenAI4ELab/papercli-papers","author":"GenAI4ELab","disabled":false,"gated":false,"lastModified":"2026-06-20T18:14:37.000Z","likes":17,"trendingScore":2,"private":false,"sha":"90a1fbd3c355717092966debf5f7f69bdc6a1cf6","description":"\n\t\n\t\t\n\t\n\t\n\t\tAI Conference & Journal Papers\n\t\n\nSearchable metadata and full-text PDF mirrors for papers from top-tier AI venues (NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, WACV, ACL, EMNLP, NAACL, IJCAI, AAAI, JMLR, Interspeech) from 2023. \n\n📊 papers.parquet: The complete dataset containing all fields and all venues.\n🔍 Per-venue browse views: Easily explore specific subsets by selecting a venue in Subset and a year in Split.\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t🏗️ Dataset Structure & Storage Strategy\n\t\n\nTo avoid… See the full description on the dataset page: https://huggingface.co/datasets/GenAI4ELab/papercli-papers.","downloads":10772,"tags":["license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-06T03:42:18.000Z","key":""},{"_id":"6a23f70b43d4c2ecb44f33ad","id":"xlangai/osworld2.0-trajectory","author":"xlangai","disabled":false,"gated":false,"lastModified":"2026-06-24T03:04:21.000Z","likes":5,"trendingScore":2,"private":false,"sha":"b2d4e7b9f2b842b64433c1af526b36c272d27fe6","downloads":23217,"tags":["size_categories:n<1K","modality:image","region:us"],"createdAt":"2026-06-06T10:31:39.000Z","key":""},{"_id":"6a2633430a09d21c4bee37a6","id":"build-small-hackathon/jawbreaker-scam-defense-data","author":"build-small-hackathon","disabled":false,"gated":false,"lastModified":"2026-06-10T00:33:28.000Z","likes":6,"trendingScore":2,"private":false,"sha":"e140767eec5efaad223256faf88784b7c6378ecd","description":"\n\t\n\t\t\n\t\n\t\n\t\tJawbreaker Scam Defense Data\n\t\n\nSynthetic and sanitized training/eval data for Jawbreaker, a local-first scam defense app for someone you love.\nJawbreaker turns a suspicious text, email, or DM into a plain-English safety card: the risk, the warning signs, and the safest next step before someone replies, clicks, or pays.\n\n\t\n\t\t\n\t\n\t\n\t\tContents\n\t\n\n\neval/: scam-defense evaluation sets from smoke checks through hard calibration suites.\neval/reports/: guarded evaluation reports for the… See the full description on the dataset page: https://huggingface.co/datasets/build-small-hackathon/jawbreaker-scam-defense-data.","downloads":201,"tags":["task_categories:text-classification","task_categories:text-generation","language:en","license:mit","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","build-small-hackathon","scam-defense","safety-eval","synthetic-data","instruction-tuning","minicpm5","local-first","jawbreaker"],"createdAt":"2026-06-08T03:13:07.000Z","key":""},{"_id":"6a2645668f44a8ad13feb407","id":"cuhk-zhuque/SkillTrustBench","author":"cuhk-zhuque","disabled":false,"gated":false,"lastModified":"2026-06-15T06:06:16.000Z","likes":8,"trendingScore":2,"private":false,"sha":"762d5388b3a047b26df9679582af868a0e5b2c8f","description":"\n\t\n\t\t\n\t\n\t\n\t\tSkillTrustBench\n\t\n\nSkillTrustBench is a benchmark dataset for evaluating security analysis of agent skills: reusable capability packages that extend an AI agent through natural-language instructions, tool-use guidance, and optional executable or reference assets. Each case follows an agent-skill-style layout, with a SKILL.md entrypoint that defines when and how the skill should be used, plus optional scripts, references, assets, configuration files, or agent definitions.\nThe… See the full description on the dataset page: https://huggingface.co/datasets/cuhk-zhuque/SkillTrustBench.","downloads":580,"tags":["task_categories:text-classification","task_ids:multi-class-classification","annotations_creators:expert-generated","language_creators:found","language_creators:machine-generated","multilinguality:multilingual","source_datasets:original","language:en","language:zh","license:other","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","cybersecurity","ai-safety","agent-safety","benchmark","code","static-analysis","malware-detection","prompt-injection","supply-chain-security","red-teaming","tool-use","skills"],"createdAt":"2026-06-08T04:30:30.000Z","key":""},{"_id":"6a283cfae1279a2eec599915","id":"UKPLab/agentcibench","author":"UKPLab","disabled":false,"gated":false,"lastModified":"2026-06-24T22:14:04.000Z","likes":2,"trendingScore":2,"private":false,"sha":"f5523cb986828ee77bc505120a933f114bbcf4f3","description":"\n\t\n\t\t\n\t\n\t\n\t\tAgentCIBench\n\t\n\nCapable but Careless: Do Computer-Use Agents Follow Contextual Integrity?\nAgentCIBench measures whether computer-use agents (CUAs) respect\ncontextual integrity (CI) when operating across personal applications.\nEach scenario is an executable, deterministically scored snapshot of a\nmulti-app workspace, paired with an under-specified user request and a\nground-truth disclosure policy (must_share / must_not_share).\n\n📄 Paper: arXiv:2606.23189\n💻 Code:… See the full description on the dataset page: https://huggingface.co/datasets/UKPLab/agentcibench.","downloads":92,"tags":["task_categories:other","language:en","license:cc-by-4.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.23189","region:us","contextual-integrity","privacy","computer-use-agents","llm-evaluation","benchmark","agent-safety"],"createdAt":"2026-06-09T16:19:06.000Z","key":""},{"_id":"6a29127ae3916e42f22a9bd0","id":"SolimanBa/Tech-Career-Paths-Dataset","author":"SolimanBa","disabled":false,"gated":false,"lastModified":"2026-06-10T07:33:57.000Z","likes":2,"trendingScore":2,"private":false,"sha":"39ee96605cce3254d1bceeb474daaa1b189a0814","downloads":53,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-10T07:30:02.000Z","key":""},{"_id":"6a294060470b7ac939ed241b","id":"victor/fable-5-boeing-747-trace","author":"victor","disabled":false,"gated":false,"lastModified":"2026-06-11T20:13:15.000Z","likes":30,"trendingScore":2,"private":false,"sha":"e146afb46a99b3873a1a61e12454ba3cd2fff299","description":"\n\t\n\t\t\n\t\n\t\n\t\tFable 5 Boeing 747: Claude Code session trace\n\t\n\nThe full Claude Code (Fable 5) session transcript that built victor/fable-5-boeing-747, a procedural Boeing 747 in Three.js, from a single /goal prompt:\n\ncreate the most realistic boeing 747 using THREEJS - use your vision capabilities to create a self verifiable system, enter a loop until you are 100% satisfied about the result (you can build a camera system to inspect each angle)\n\nThe agent built the model, set up a headless… See the full description on the dataset page: https://huggingface.co/datasets/victor/fable-5-boeing-747-trace.","downloads":1525,"tags":["license:mit","size_categories:n<1K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agent-traces","claude-code","threejs","fable-5"],"createdAt":"2026-06-10T10:45:52.000Z","key":""},{"_id":"6a2a5f5f2ef38e1f849a8ebf","id":"tencent/Hy-Embodied-0.5-VLA-Data","author":"tencent","disabled":false,"gated":false,"lastModified":"2026-06-25T09:43:40.000Z","likes":14,"trendingScore":2,"private":false,"sha":"3c1e9fa080d2aa17d55651c98e6a9139e5a6936c","description":"\nHy-Embodied-0.5-VLA\nFrom Vision-Language-Action Models to a Real-World Robot Learning Stack\nTencent Robotics X × Tencent Hy Team\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t📖 Abstract\n\t\n\nWe introduce Hy-Embodied-0.5-VLA (Hy-VLA) — an end-to-end Vision-Language-Action system that spans the full robot learning stack: data collection, model design, pre-training, supervised fine-tuning, RL post-training, and real-world deployment. Built on the Hy-Embodied-0.5 MoT backbone, Hy-VLA integrates a flow-matching… See the full description on the dataset page: https://huggingface.co/datasets/tencent/Hy-Embodied-0.5-VLA-Data.","downloads":160449,"tags":["task_categories:robotics","task_categories:reinforcement-learning","license:cc-by-4.0","size_categories:n<1K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","library:lerobot","library:lance","arxiv:2606.14409","region:us","robotics","manipulation","bimanual","VLA","leRobot","lance","imitation-learning"],"createdAt":"2026-06-11T07:10:23.000Z","key":""},{"_id":"6a2b051031a20563f82dcada","id":"trace-commons/agent-traces","author":"trace-commons","disabled":false,"gated":false,"lastModified":"2026-06-18T06:23:00.000Z","likes":23,"trendingScore":2,"private":false,"sha":"112ebd4d03ce852b00e935d523107c3d0c9a65bf","description":"\n\t\n\t\t\n\t\n\t\n\t\tTrace Commons — Agent Traces\n\t\n\nTrace Commons is one open, public dataset of coding-agent sessions — the\nback-and-forth between a developer and an AI coding agent, including prompts,\nmodel responses, tool calls, and command output — contributed voluntarily as an\nopen resource for studying, evaluating, and building on how these agents\nactually work.\nEvery trace here was donated only from a public, open-source repository, was\nanonymized on the contributor's own machine before upload… See the full description on the dataset page: https://huggingface.co/datasets/trace-commons/agent-traces.","downloads":1170,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:n<1K","format:parquet","format:optimized-parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","agent","agent-traces","coding-agent","traces","tool-use","open-data"],"createdAt":"2026-06-11T18:57:20.000Z","key":""},{"_id":"6a2b3987b36bc04def6fa18e","id":"zekaiwang/trex_dataset","author":"zekaiwang","disabled":false,"gated":false,"lastModified":"2026-06-16T09:09:32.000Z","likes":6,"trendingScore":2,"private":false,"sha":"bf0eb24c4b8bdd95752b553f0fc50e46a22f1cc8","description":"\n\t\n\t\t\n\t\n\t\n\t\tT-Rex Dataset\n\t\n\nA large-scale, tactile-reactive bimanual manipulation dataset, collected via teleoperation on a\nDexmate Vega-1 robot with two Sharpa Wave dexterous hands. Stored as a\nLeRobotDataset v3.0.\n🌐 Project Page · ✍️ Paper (arXiv) · 💻 Code (T-Rex) · 🚀 Dataset Quickstart · 📓 Colab notebook\n\n  \n  \n  One episode from each of 20 motor primitives (head-camera view, cropped to the workspace), each with a different object.\n\n\n\n  \n  \n  Teleoperation setup: Manus gloves + VIVE… See the full description on the dataset page: https://huggingface.co/datasets/zekaiwang/trex_dataset.","downloads":140013,"tags":["task_categories:robotics","language:en","license:mit","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","modality:timeseries","modality:video","library:datasets","library:dask","library:polars","library:mlcroissant","library:lerobot","arxiv:2606.17055","region:us","LeRobot","robotics","manipulation","tactile","bimanual","dexterous-manipulation"],"createdAt":"2026-06-11T22:41:11.000Z","key":""},{"_id":"6a2c197e671ba44c16fb8e82","id":"ratschlab/TCGA_virtual_spatial_transcriptomics_atlas","author":"ratschlab","disabled":false,"gated":"manual","lastModified":"2026-06-30T06:30:25.000Z","likes":3,"trendingScore":2,"private":false,"sha":"fb1fe2515f33098a8d9d0081ae4a05f8b3c98970","description":"\n\t\n\t\t\n\t\n\t\n\t\tTCGA digital spatial transcriptomics atlas\n\t\n\nThis repository contains predicted spatial transcriptomics for TCGA H&E slides,\nboth fresh-frozen (FF) and formalin-fixed paraffin-embedded (FFPE), produced\nwith DeepSpot-M.\nAuthors: Kalin Nonchev, Sebastian Dawo, Karina Silina, Viktor Hendrik\nKoelzer, and Gunnar Rätsch.\nPaper: DeepSpot-M: a multimodal foundation model for transcriptome-wide virtual spatial transcriptomics from histology (medRxiv, 2026; see the citation below).\nCode:… See the full description on the dataset page: https://huggingface.co/datasets/ratschlab/TCGA_virtual_spatial_transcriptomics_atlas.","downloads":174,"tags":["language:en","license:cc-by-nc-sa-4.0","size_categories:n<1K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us","spatial-transcriptomics","histology","pathology","transcriptomics","machine-learning","TCGA","computational-pathology","foundation-model","multimodal","virtual-spatial-transcriptomics","whole-slide-imaging","oncology","cancer","gene-expression","deep-learning","DeepSpot-M"],"createdAt":"2026-06-12T14:36:46.000Z","key":""},{"_id":"6a306d909f9520d026db14cf","id":"crosbylegal/RedlineBench","author":"crosbylegal","disabled":false,"gated":false,"lastModified":"2026-06-18T14:40:28.000Z","likes":14,"trendingScore":2,"private":false,"sha":"eee1b6790982ed1279e86bec7616b662a61993e6","description":"\n  \n\n\n\n\t\n\t\t\n\t\n\t\n\t\tAbstract\n\t\n\nCrosby–micro1 RedlineBench measures contract negotiation\nas a sequence of judgment calls rather than a collection of isolated clause edits.\nIt captures multi-turn redlining workflows through simulations grounded in\nrealistic SaaS transactions and attorney-generated explanations of key redline\ndecisions, and evaluates models across five dimensions: legal correctness,\ncommercial alignment, negotiation quality, counterparty-acceptance prediction, and\ndeal-closing… See the full description on the dataset page: https://huggingface.co/datasets/crosbylegal/RedlineBench.","downloads":4569,"tags":["benchmark:official","benchmark:eval-yaml","task_categories:text-generation","annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-4.0","size_categories:n<1K","format:parquet","modality:document","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","legal","negotiation","benchmark"],"createdAt":"2026-06-15T21:24:32.000Z","key":""},{"_id":"6a313297671ba44c169b69c0","id":"HKUSTAudio/ISCSLP2026-CoT-TTS","author":"HKUSTAudio","disabled":false,"gated":false,"lastModified":"2026-06-30T02:53:20.000Z","likes":10,"trendingScore":2,"private":false,"sha":"1423344c3cab0852cf840d6c93c976efcebac0fa","description":"\n\t\n\t\t\n\t\n\t\n\t\tISCSLP 2026 CoT-TTS Dataset\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Overview\n\t\n\nThis dataset is prepared for the ISCSLP 2026 CoT-TTS Challenge and is designed to support research on context-aware, expressive, and CoT-guided speech generation. It is constructed from speech-rich media sources, including films, TV dramas, radio dramas, and short dramas, where dialogue often contains rich conversational context, speaker interactions, scene changes, and emotional variation. Each sample is organized… See the full description on the dataset page: https://huggingface.co/datasets/HKUSTAudio/ISCSLP2026-CoT-TTS.","downloads":9150,"tags":["arxiv:2606.21933","region:us"],"createdAt":"2026-06-16T11:25:11.000Z","key":""},{"_id":"6a31fb7d840df2d57f83c572","id":"nvidia/Nemotron-Personas-Belgium","author":"nvidia","disabled":false,"gated":false,"lastModified":"2026-06-17T05:12:10.000Z","likes":33,"trendingScore":2,"private":false,"sha":"b13368c38c5667c9b8b035accaf0d2b3298b38b3","description":"\n\t\n\t\t\n\t\n\t\n\t\tNemotron-Personas-Belgium\n\t\n\n\n  \n  \n    (NL) Een compound-AI-benadering van meertalige Belgische persona's, verankerd in reële verdelingen\n    (FR) Une approche d'IA composée pour des personas belges multilingues, ancrés dans des distributions réelles\n    (DE) Ein Compound-KI-Ansatz für mehrsprachige belgische Personas, verankert in realen Verteilungen\n    (EN) A compound AI approach to multilingual Belgian personas grounded in real-world distributions\n  \n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverzicht… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Personas-Belgium.","downloads":2335,"tags":["task_categories:text-generation","language:nl","language:fr","language:de","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","library:datadesigner","region:us","synthetic","personas","NVIDIA","datadesigner","belgium","Dutch","French","German","English"],"createdAt":"2026-06-17T01:42:21.000Z","key":""},{"_id":"6a32a0b2f45a75defedc637f","id":"orena-dkfz/lapchole-focus-vqa","author":"orena-dkfz","disabled":false,"gated":"manual","lastModified":"2026-06-22T22:25:00.000Z","likes":2,"trendingScore":2,"private":false,"sha":"fe18141af87ac71f1cb95f94db705b4bc1bdb073","description":"\n\n\n\t\n\t\t\n\t\n\t\n\t\tLapChole-FOCUS-VQA\n\t\n\nA clinically grounded benchmark for long-context video understanding in minimally invasive surgery.\n💻 Code &nbsp;•&nbsp; 🏆 Challenge &nbsp;•&nbsp; ⚖️ Data Usage Agreement\n\n\n\n\n[!IMPORTANT]\n\n\t\n\t\t\n\t\n\t\n\t\t🔒 This is a gated dataset\n\t\n\nAccess is granted only to participants of the ORena FOCUS Challenge and is subject to manual review. To be approved you must:\n\nHave a Hugging Face account and be logged in — downloads are only enabled for registered, authenticated… See the full description on the dataset page: https://huggingface.co/datasets/orena-dkfz/lapchole-focus-vqa.","downloads":629,"tags":["task_categories:visual-question-answering","language:en","license:other","size_categories:10K<n<100K","format:parquet","modality:text","modality:video","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","medical","surgical","video-understanding","laparoscopy","cholecystectomy","foreign-objects"],"createdAt":"2026-06-17T13:27:14.000Z","key":""},{"_id":"6a32ba24a0e8b124621ea973","id":"robotcom/R-KNav_dataset","author":"robotcom","disabled":false,"gated":"manual","lastModified":"2026-06-25T15:46:12.000Z","likes":2,"trendingScore":2,"private":false,"sha":"7d20dab3888d48cd720d05e876086f79123c8f4a","description":"R-KNav Dataset\n\n\n  \n  \n\n\n\n\t\n\t\t\n\t\n\t\n\t\tMotivation\n\t\n\nThe transition of robotics from rigid, programmed behaviors to AI-driven autonomy requires advanced foundation models capable of learning from complex datasets. To catalyze this evolution, Robot.com has developed the R-KNav dataset as a resource for the global AI community. Derived from real-world operations of the R-Kiwi sidewalk rover fleet across 15 US locations, this initiative captures 10,000 hours of multimodal data in the Lerobot v3.0… See the full description on the dataset page: https://huggingface.co/datasets/robotcom/R-KNav_dataset.","downloads":66,"tags":["language:en","license:other","size_categories:1K<n<10K","modality:video","library:datasets","library:mlcroissant","library:lerobot","doi:10.57967/hf/9276","region:us","robotics","lerobot","navigation"],"createdAt":"2026-06-17T15:15:48.000Z","key":""},{"_id":"6a33eb86c17c0c5dc3265ea9","id":"Venn2024/MedStreamBench","author":"Venn2024","disabled":false,"gated":false,"lastModified":"2026-06-29T06:18:08.000Z","likes":2,"trendingScore":2,"private":false,"sha":"1486f0fb8d630eb7b3129a40661f59cd7dda87d1","description":"\n\t\n\t\t\n\t\n\t\n\t\tMedStreamBench\n\t\n\nThis directory contains dataset-scoped JSONL benchmark files for MedStreamBench, a time-aware benchmark for medical video question answering under evidence-constrained settings.\n\n\t\n\t\t\n\t\n\t\n\t\tContents\n\t\n\nEach .jsonl file corresponds to one source dataset integrated into MedStreamBench. The current files are:\n\nAVOS.jsonl\nAlxSuture.jsonl\nAutoLapora.jsonl\nBernBypass70.jsonl\nCVC-ClinicDB.jsonl\nCholecT45.jsonl\nCholecT50.jsonl\nColonoscopic-addi.jsonl… See the full description on the dataset page: https://huggingface.co/datasets/Venn2024/MedStreamBench.","downloads":42,"tags":["license:apache-2.0","region:us"],"createdAt":"2026-06-18T12:58:46.000Z","key":""},{"_id":"6a3415efad622e1058b63a8f","id":"slprl/common-sense-facts-audio","author":"slprl","disabled":false,"gated":false,"lastModified":"2026-06-25T09:10:34.000Z","likes":2,"trendingScore":2,"private":false,"sha":"f0038948b4053b8c83e6b67133112742f72890bb","description":"\n\t\n\t\t\n\t\n\t\n\t\tCommon-Sense Facts Audio Dataset\n\t\n\nA spoken fact-completion dataset for evaluating whether Speech Language Models can retrieve common-sense and factual knowledge from speech.\nEach example contains three paired versions:\n\nprompt: an incomplete factual prompt, e.g. \"the capital of France is\"\nfact: the correct full sentence, e.g. \"the capital of France is Paris\"\ncounterfactual: an incorrect matched sentence from the same category, e.g. \"the capital of France is Rome\"\n\nThe dataset… See the full description on the dataset page: https://huggingface.co/datasets/slprl/common-sense-facts-audio.","downloads":80,"tags":["task_categories:automatic-speech-recognition","language:en","license:other","size_categories:n<1K","format:parquet","modality:audio","modality:text","modality:timeseries","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.22473","region:us","speech","speech-language-models","spoken-language-understanding","common-sense","factual-knowledge","factual-recall","likelihood-evaluation","audio-reasoning","interpretability"],"createdAt":"2026-06-18T15:59:43.000Z","key":""},{"_id":"6a344e1cb5e7d1bd7428c9a6","id":"WithinUsAI/fable_5_distillation_merged_cleaned_25k","author":"WithinUsAI","disabled":false,"gated":false,"lastModified":"2026-06-18T20:13:06.000Z","likes":2,"trendingScore":2,"private":false,"sha":"86246165324ac630e37a3cfcf163d94bf425f295","description":"\n\t\n\t\t\n\t\n\t\n\t\tClaude Fable 5 Distillation Dataset\n\t\n\n25,719 high-quality distilled examples for training LLMs to mimic Claude Fable 5's reasoning style — featuring multi-step chain-of-thought with <think> tags across 23+ technical domains.\nThis dataset captures the distinctive reasoning patterns of Claude Fable 5 (Anthropic's Mythos-class model released June 2026): systematic decomposition, first-principles analysis, self-verification, alternative consideration, and synthesis.… See the full description on the dataset page: https://huggingface.co/datasets/WithinUsAI/fable_5_distillation_merged_cleaned_25k.","downloads":175,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","fable5","mythos","claude","opus4.8"],"createdAt":"2026-06-18T19:59:24.000Z","key":""},{"_id":"6a34b55dc2d3795a1c19e473","id":"Junle-cs/trip-plus-database","author":"Junle-cs","disabled":false,"gated":false,"lastModified":"2026-06-19T03:31:33.000Z","likes":2,"trendingScore":2,"private":false,"sha":"274878509b8f3a7eab7ef97c781e55247effa3b9","downloads":1417,"tags":["region:us"],"createdAt":"2026-06-19T03:19:57.000Z","key":""},{"_id":"6a34d5c05dfc204fc70cd95e","id":"sediment1024/PhysRAG","author":"sediment1024","disabled":false,"gated":false,"lastModified":"2026-06-26T03:12:15.000Z","likes":2,"trendingScore":2,"private":false,"sha":"72ed62ffb81271a62f2ea83736f6bf4dad592e2e","description":"\n\t\n\t\t\n\t\n\t\n\t\tPhysRAG Curated Physical Video Dataset\n\t\n\n\nAuthors: Kexu Cheng, Zicheng Liu, Mingju Gao, Chunhe Song, Hao TangProject and code: https://github.com/sediment1024/PhysRAGPaper: PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented GenerationPaper URL: arXiv:2606.26916\n\nThis dataset contains 6,869 curated physical-dynamics videos and English text\nprompts derived from WISA-80K. It also marks a 170-video reference subset used by\nthe PhyRAG retrieval database.… See the full description on the dataset page: https://huggingface.co/datasets/sediment1024/PhysRAG.","downloads":306,"tags":["task_categories:text-to-video","language:en","license:other","modality:text","arxiv:2606.26916","arxiv:2503.08153","region:us","video-generation","physics","retrieval-augmented-generation"],"createdAt":"2026-06-19T05:38:08.000Z","key":""},{"_id":"6a36430504d9ed1211305fcb","id":"JiayuJeff/PlanBench-XL","author":"JiayuJeff","disabled":false,"gated":false,"lastModified":"2026-06-28T16:43:38.000Z","likes":4,"trendingScore":2,"private":false,"sha":"8d8d3d1a6fa7954a756193cf0f142c875a769b80","description":"\n\t\n\t\t\n\t\n\t\n\t\tPlanBench-XL\n\t\n\nPlanBench-XL is a benchmark dataset for evaluating long-horizon tool-use and planning agents on structured information retrieval tasks. Each example is a natural-language query that asks an agent to trace linked records and return a specific target field.\n\nProject page: https://planbench-xl.github.io/\nGitHub: JiayuJeff/PlanBench-XL\nPaper: https://arxiv.org/pdf/2606.22388\n\nThe current release contains 327 query examples in:\nPlanBench-XL_queries.json… See the full description on the dataset page: https://huggingface.co/datasets/JiayuJeff/PlanBench-XL.","downloads":101,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.22388","region:us"],"createdAt":"2026-06-20T07:36:37.000Z","key":""},{"_id":"6a384bd57251293c9f063fdd","id":"mbaye930/WolofEntityLinking","author":"mbaye930","disabled":false,"gated":false,"lastModified":"2026-06-26T13:17:57.000Z","likes":2,"trendingScore":2,"private":false,"sha":"3d40cab0541f7c6d91ac9bd8fe3d27b6b31b26dc","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nEntity Linking for low-resource African languages faces a major bottleneck: orthographic variation.\nThis gold standard dataset contains:\n\n1,045 sentences (filtered from MasakhaNER 2.0 validation and test splits to exclude sentences without entity mentions or with only DATE mentions).\n2,049 entity mentions categorized into LOC, ORG, and PER.\n1,704 linkable mentions (In-KB) resolved to 565 unique Wikidata QIDs.\n345 NIL mentions (Out-of-KB) representing entities… See the full description on the dataset page: https://huggingface.co/datasets/mbaye930/WolofEntityLinking.","downloads":60,"tags":["task_categories:token-classification","task_categories:text-retrieval","task_categories:text-generation","language:wo","license:cc-by-4.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","NER,","entity-linking","low-resource-nlp","wikidata","wolof"],"createdAt":"2026-06-21T20:38:45.000Z","key":""},{"_id":"6a385719b57803e6167e86f9","id":"MatrAIx2026/MatrAIx2026","author":"MatrAIx2026","disabled":false,"gated":"manual","lastModified":"2026-06-28T20:17:12.000Z","likes":3,"trendingScore":2,"private":false,"sha":"20e51a199d8749f78c86cb66d7db8f225c8e1849","description":"\n\t\n\t\t\n\t\n\t\n\t\tMatrAIx Persona Dataset\n\t\n\nTotal Personas: 184,770,129+ | Total Files: 28,222 | Sources: 16 datasets (13 integrated, 3 pending) | Status: ✅ Complete + ⏳ Integrating (2026-06-27)\nA comprehensive, multi-source persona dataset combining professional personas, synthetic personas, user-derived data, and large-scale behavioral data from multiple sources.\n\n\t\n\t\t\n\t\n\t\n\t\t📊 Complete Data Sources & Statistics\n\t\n\n\n\t\n\t\t\nSource\nType\nPersonas\nFiles\nFormat\nLocation\nStatus\nUploaded By… See the full description on the dataset page: https://huggingface.co/datasets/MatrAIx2026/MatrAIx2026.","downloads":3569,"tags":["region:us"],"createdAt":"2026-06-21T21:26:49.000Z","key":""},{"_id":"6a3865c5126105f6d4df7547","id":"Y-Research-Group/VisReason","author":"Y-Research-Group","disabled":false,"gated":false,"lastModified":"2026-06-21T22:35:41.000Z","likes":2,"trendingScore":2,"private":false,"sha":"e8f3e9bd0b3907290f8c4cbbfc3d87d8dedf8396","description":"\n\t\n\t\t\n\t\n\t\n\t\tVisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning\n\t\n\nVisReason is a large-scale dataset designed to advance visual Chain-of-Thought (CoT)\nreasoning in multimodal large language models (MLLMs). Rather than mapping an image\ndirectly to an answer, VisReason supervises a human-like, global-to-local reasoning\nprocess: the model first forms a holistic hypothesis about the scene, then iteratively\nzooms into salient regions (areas of interest) to collect fine-grained… See the full description on the dataset page: https://huggingface.co/datasets/Y-Research-Group/VisReason.","downloads":89,"tags":["task_categories:visual-question-answering","task_categories:image-text-to-text","language:en","license:other","size_categories:100K<n<1M","region:us","visual-chain-of-thought","visual-reasoning","multimodal","grounding","multi-round-reasoning"],"createdAt":"2026-06-21T22:29:25.000Z","key":""},{"_id":"6a38a668b22e71da5a849326","id":"agibot-world/GenieSim3.0-Dataset","author":"agibot-world","disabled":false,"gated":false,"lastModified":"2026-06-29T13:18:09.000Z","likes":2,"trendingScore":2,"private":false,"sha":"143f58e64c7b779290fc1b27f92ce6eff5ad40a2","downloads":361,"tags":["region:us"],"createdAt":"2026-06-22T03:05:12.000Z","key":""},{"_id":"6a391c0133fdd2c66ecafe08","id":"FudanCVL/FeVOS","author":"FudanCVL","disabled":false,"gated":false,"lastModified":"2026-06-23T08:20:51.000Z","likes":2,"trendingScore":2,"private":false,"sha":"09e17a888b79f75d2067ce419a5c1c96077babb7","description":"\n\t\n\t\t\n\t\n\t\n\t\tFeVOS\n\t\n\nFeVOS is a video understanding dataset for future-oriented referring expression and object segmentation tasks. The dataset pairs video frame sequences with natural-language expressions, answers, and object/mask annotations.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Contents\n\t\n\nThe repository contains:\n\nJPEGImages.zip: compressed frame images. Extracting this archive creates a JPEGImages/ directory with per-video frame folders.\nmeta_expressions_train.json: training split metadata.… See the full description on the dataset page: https://huggingface.co/datasets/FudanCVL/FeVOS.","downloads":104,"tags":["language:en","size_categories:10K<n<100K","region:us","video-object-segmentation","referring-expression","visual-question-answering","video-understanding","segmentation"],"createdAt":"2026-06-22T11:26:57.000Z","key":""},{"_id":"6a392e54f06c2d16bd6b2d27","id":"denyser/RedditTurkey","author":"denyser","disabled":false,"gated":false,"lastModified":"2026-06-27T22:28:30.000Z","likes":2,"trendingScore":2,"private":false,"sha":"c9797ed631b1339cccc6686ebb3cb25d0a7a1552","description":"\n\t\n\t\t\n\t\n\t\n\t\t🧠 RedditTurkey Dataset\n\t\n\nEn az 100k sample a ulaştım çok büyük bir dataset ağı kurmak istiyorum gönüllü ekibe katılmak isteyenler eklesin ig:mehmetucarcok güncelleme:28.06.26\nLLANET OLSUN HARDDİSK ÇKTÜ VE VERİLER GİTTİ 2 GÜN İÇİNDE YAYINLAYACAĞIM\n\n\t\n\t\t\n\t\n\t\n\t\t📌 Overview\n\t\n\nRedditTurkey is a multilingual dataset collected from Reddit discussions across various subreddits, with a strong focus on Turkish communities and mixed English content.\nThe dataset contains user comments… See the full description on the dataset page: https://huggingface.co/datasets/denyser/RedditTurkey.","downloads":69,"tags":["language:tr","language:en","license:mit","region:us","reddit","turkish","english","comments","social-media","nlp","dataset","conversational-ai","scraping"],"createdAt":"2026-06-22T12:45:08.000Z","key":""},{"_id":"6a39377f8b119b745bf5809e","id":"SagiPolaczek/sync-lora","author":"SagiPolaczek","disabled":false,"gated":false,"lastModified":"2026-06-22T13:52:01.000Z","likes":2,"trendingScore":2,"private":false,"sha":"9436387a00a062291d15ed6e7d1dabdcd1973384","description":"\n\t\n\t\t\n\t\n\t\n\t\tsync-lora\n\t\n\nA reference-conditioned video-to-video dataset for training LTX-2 IC-LoRAs\n(In-Context LoRA). Each sample pairs a reference video (conditioning input)\nwith a target video (desired output) and a caption. 532 paired samples,\n480×480 source, 81 frames (intended for the 512×512×81 LTX-2 bucket).\nThis repo ships raw videos + metadata in the layout LTX-2's\nprocess_dataset.py expects. It does not include precomputed latents —\nthose are tied to a specific VAE / text-encoder… See the full description on the dataset page: https://huggingface.co/datasets/SagiPolaczek/sync-lora.","downloads":921,"tags":["task_categories:text-to-video","task_categories:video-to-video","license:other","size_categories:1K<n<10K","modality:video","library:datasets","library:mlcroissant","region:us","ltx-video","ltx-2","lora","ic-lora","reference-video","lip-sync"],"createdAt":"2026-06-22T13:24:15.000Z","key":""},{"_id":"6a3994da8113c164a4231703","id":"abdurrehman456/GLM_5.2_Training_Data","author":"abdurrehman456","disabled":false,"gated":false,"lastModified":"2026-06-25T10:46:16.000Z","likes":2,"trendingScore":2,"private":false,"sha":"bd497649876d08bd8847a8f58e85c572bbde9fcf","downloads":254,"tags":["size_categories:1M<n<10M","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-22T20:02:34.000Z","key":""},{"_id":"6a3a0acc16a91ce3daa79002","id":"msw-ai-tf/maplestory-worlds-creator-docs","author":"msw-ai-tf","disabled":false,"gated":false,"lastModified":"2026-06-23T04:42:17.000Z","likes":2,"trendingScore":2,"private":false,"sha":"ac84b6e2e3ae05808062c2dcca59d0c1a8c8144a","description":"\n\t\n\t\t\n\t\n\t\n\t\tMapleStory Worlds Creator Center Documentation\n\t\n\nA curated dataset built from the official documentation of the\nMapleStory Worlds Creator Center.\nIt is a parallel Korean/English documentation corpus intended for RAG, search,\nembeddings, and domain language-model training.\nThe dataset covers all three Creator Center content types — guide documents\n(doc), API Reference (api), and resources (res).\n\n\t\n\t\t\n\t\n\t\n\t\tComposition\n\t\n\nDocument counts by type and language:\n\n\t\n\t\t\ntype\nDescription… See the full description on the dataset page: https://huggingface.co/datasets/msw-ai-tf/maplestory-worlds-creator-docs.","downloads":42,"tags":["task_categories:text-generation","task_categories:text-retrieval","multilinguality:multilingual","language:ko","language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","maplestory","maplestory-worlds","game-development","documentation","lua","rag"],"createdAt":"2026-06-23T04:25:48.000Z","key":""},{"_id":"6a3a4c1f8177119f288c3a13","id":"PhysiQuanty/FRENCH-ONLY-Common-Crawl-2026-25","author":"PhysiQuanty","disabled":false,"gated":false,"lastModified":"2026-06-26T06:17:05.000Z","likes":2,"trendingScore":2,"private":false,"sha":"4312d864844946d113935a1f2a67aa5c28a54222","downloads":543,"tags":["license:apache-2.0","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-23T09:04:31.000Z","key":""},{"_id":"6a3a6f09492b5e48b1ebff54","id":"HuggingFaceBio/vepqa","author":"HuggingFaceBio","disabled":false,"gated":false,"lastModified":"2026-06-25T22:42:25.000Z","likes":2,"trendingScore":2,"private":false,"sha":"da7f717e8f68172ec973b028071299821c299b68","description":"\n\t\n\t\t\n\t\n\t\n\t\tVEPQA\n\t\n\nVEPQA is a supervised fine-tuning dataset for variant-effect-prediction style question answering over DNA sequence context. It is built from raw ClinVar variant records and UCSC hg38 reference sequence windows, with deterministic gold labels from ClinVar.\n\n\t\n\t\t\n\t\n\t\n\t\tConfigs\n\t\n\n\nbinary: binary clinical-significance classification.\nclinvar_levels: finer-grained ClinVar clinical-significance classification.\npathogenic_pairwise: reference-vs-alternate sequence preference for… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceBio/vepqa.","downloads":168,"tags":["size_categories:10K<n<100K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-23T11:33:29.000Z","key":""},{"_id":"6a3ae1ca4df11e17878f2ec1","id":"multimodalart/krea2-aoti-kernels","author":"multimodalart","disabled":false,"gated":false,"lastModified":"2026-06-23T19:43:13.000Z","likes":2,"trendingScore":2,"private":false,"sha":"e721af162afddda6ac23c553c7b469dfd4e6161a","downloads":63,"tags":["region:us"],"createdAt":"2026-06-23T19:43:06.000Z","key":""},{"_id":"6a3af86513b169635df00fe2","id":"TonicAI/Privacy-Bench","author":"TonicAI","disabled":false,"gated":false,"lastModified":"2026-06-26T18:22:32.000Z","likes":2,"trendingScore":2,"private":false,"sha":"eec8ed6761205be37ff724f3c50887b074149334","description":"\n\t\n\t\t\n\t\n\t\n\t\tPrivacyBench\n\t\n\n\nPrivacyBench is a benchmark for de-identifying semi-structured data exports from work tools like email, messaging, calendar, and so forth. The benchmark focuses on the identification and synthesis of PII in the unstructured text fields of the data export, and introduces novel metrics for evaluating synthesis quality. This initial version consists of data exports of Slack and email messages from 21 distinct personas generated by the Fabricate synthetic data tool… See the full description on the dataset page: https://huggingface.co/datasets/TonicAI/Privacy-Bench.","downloads":157,"tags":["task_categories:token-classification","task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:10K<n<100K","region:us","pii","de-identification","data-synthesis","privacy","ner","named-entity-recognition","synthetic-data","email","slack","benchmark","llm-as-a-judge"],"createdAt":"2026-06-23T21:19:33.000Z","key":""},{"_id":"6a3afd70dce362467790d7e7","id":"Quaxicron/Fable-5-traces","author":"Quaxicron","disabled":false,"gated":false,"lastModified":"2026-06-23T21:41:05.000Z","likes":2,"trendingScore":2,"private":false,"sha":"35b9e1f8cc75030398fea687c92a13f5c61bbaf8","description":"A simple dataset of the raw Fable 5 Claude session logs we could get our hands on before it was taken away (no clue if it's coming back).\nThe raw trace files live in sessions/*.jsonl. Cache files, paste-cache files, shell history, and merged COT training exports are intentionally omitted so Hugging Face Datasets can load the repo through the agent-traces path.\n\n\t\n\t\t\n\t\n\t\n\t\tA pretty viewer for dataset:… See the full description on the dataset page: https://huggingface.co/datasets/Quaxicron/Fable-5-traces.","downloads":176,"tags":["license:agpl-3.0","size_categories:n<1K","format:json","format:agent-traces","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2026-06-23T21:41:04.000Z","key":""},{"_id":"6a3bac09a2366d44ce8a4504","id":"tsinghua-ee/DVD-Bench","author":"tsinghua-ee","disabled":false,"gated":"auto","lastModified":"2026-06-24T10:56:52.000Z","likes":2,"trendingScore":2,"private":false,"sha":"45bf982a0e73f7d541ed1548ca69709daf67606e","description":"\n\t\n\t\t\n\t\n\t\n\t\tDVD-Bench\n\t\n\nA benchmark for Dialogue-centric Video Description, evaluating \"When, Who, and What is Said\" in dialogue-centric videos.\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tRepository layout\n\t\n\nDVD-Bench/\n├── data/\n│   └── en/\n│       └── test.parquet       # English test split annotations\n└── videos/\n    └── en/\n        └── *.mp4              # English test split videos\n\n\n\t\n\t\t\n\t\n\t\n\t\tAnnotation schema\n\t\n\nEach row in data/en/test.parquet contains:\n\n\t\n\t\t\nfield\ntype\ndescription\n\n\n\t\t\nvideo\nstring\nVideo… See the full description on the dataset page: https://huggingface.co/datasets/tsinghua-ee/DVD-Bench.","downloads":14,"tags":["task_categories:video-text-to-text","language:en","license:cc-by-nc-4.0","size_categories:n<1K","format:parquet","modality:text","modality:video","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2602.07960","region:us","video","dialogue","speaker","benchmark"],"createdAt":"2026-06-24T10:06:01.000Z","key":""},{"_id":"6a3ce8c686d6810dc6c388ff","id":"CSDDSFSFSAFSAF/Reflect-R1-data","author":"CSDDSFSFSAFSAF","disabled":false,"gated":false,"lastModified":"2026-06-29T03:49:09.000Z","likes":2,"trendingScore":2,"private":false,"sha":"62f47da07219c58e95e4df90c7ed9ba4f61b9d4b","description":"\n\t\n\t\t\n\t\n\t\n\t\tReflect-R1 Data\n\t\n\nPublic training data for Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding.\n\nPaper: https://arxiv.org/abs/2606.27922\nCode: https://github.com/ShuimuChen-hyq/Reflect-R1\nModel: https://huggingface.co/CSDDSFSFSAFSAF/Reflect-R1\n\n\n\t\n\t\t\n\t\n\t\n\t\tFiles\n\t\n\ndata/reflect_r1_cot_90k.jsonl        Reflect-R1-CoT-90k cold-start SFT data\ndata/reflect_r1_rl_30k_short.json    Reflect-R1-RL-30k short-video split\ndata/reflect_r1_rl_30k_long.json… See the full description on the dataset page: https://huggingface.co/datasets/CSDDSFSFSAFSAF/Reflect-R1-data.","downloads":42,"tags":["task_categories:visual-question-answering","task_categories:video-classification","language:en","license:apache-2.0","size_categories:100K<n<1M","arxiv:2606.27922","region:us","long-video-understanding","video-question-answering","multimodal-reasoning","self-correction","reflection","reinforcement-learning","grpo"],"createdAt":"2026-06-25T08:37:26.000Z","key":""},{"_id":"6a3d844fecf2dd948ff49b27","id":"AhiskaAI/python-instruct-turkish","author":"AhiskaAI","disabled":false,"gated":false,"lastModified":"2026-06-25T19:42:08.000Z","likes":2,"trendingScore":2,"private":false,"sha":"477f3c448136febb515c53a1d2de79280d7ea137","description":"\n\t\n\t\t\n\t\n\t\n\t\tAhiskaAI Python Instruct Turkish Dataset\n\t\n\nAhiskaAI Python Instruct Turkish is a large-scale, high-quality instruction-following dataset containing 10,823 unique Python programming tasks, code solutions, and line-by-line logical explanations in Turkish. \nThis dataset is specifically tailored to enhance the Python code-generation, bug-fixing, and algorithmic reasoning capabilities of Small Language Models (SLMs) in the Turkish language ecosystem.\n\n\n\t\n\t\t\n\t\n\t\n\t\t📊 Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/AhiskaAI/python-instruct-turkish.","downloads":55,"tags":["task_categories:text-generation","language:tr","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","python","code-generation","instruct","alpaca","synthetic","tr-code"],"createdAt":"2026-06-25T19:41:03.000Z","key":""},{"_id":"6a3da505d3556e2fa0afd324","id":"kauandivino/FakenewsBR","author":"kauandivino","disabled":false,"gated":false,"lastModified":"2026-06-25T22:28:41.000Z","likes":2,"trendingScore":2,"private":false,"sha":"e30191221ccc7c8c60cd6d4ab7d6f8f7617cd055","description":"\n\t\n\t\t\n\t\n\t\n\t\tFakenewsBR\n\t\n\nFakenewsBR is a Brazilian Portuguese dataset for misinformation and fake news research. It aggregates texts from multiple Brazilian sources, including news articles, tweets, WhatsApp messages, and political/news-related content, and provides a normalized binary veracity label for each record.\nRepository: [https://github.com/AKCIT-FN/fakenews-data]\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\nField\nValue\n\n\n\t\t\nDataset name\nFakenewsBR Factchecked\n\n\nLanguage\nBrazilian Portuguese… See the full description on the dataset page: https://huggingface.co/datasets/kauandivino/FakenewsBR.","downloads":48,"tags":["language:pt","size_categories:10K<n<100K","region:us","fake-news","misinformation","fact-checking"],"createdAt":"2026-06-25T22:00:37.000Z","key":""},{"_id":"6a3e89a645bb5fc9e75248e2","id":"Sophelio/fusion-equilibrium-challenge","author":"Sophelio","disabled":false,"gated":false,"lastModified":"2026-06-26T15:20:47.000Z","likes":2,"trendingScore":2,"private":false,"sha":"552c7792a1213b2a7073428e659c944ca93a16ba","description":"\n\t\n\t\t\n\t\n\t\n\t\t⚛️ The Fusion Equilibrium Challenge\n\t\n\nPredict the shape of a fusion plasma (the magnetic equilibrium, $\\psi$) from\ncontrol inputs and diagnostics alone — without magnetic sensors. Data comes from\ntwo tokamaks:\n\nDIII-D — General Atomics tokamak (San Diego, USA)\nMAST — Mega Ampere Spherical Tokamak (Culham, UK)\n\nIn data-science terms this is an image-regression / control problem: predict a\n2-D poloidal flux map (efit_psirz) at each EFIT timestep from coil currents\n(the actuators)… See the full description on the dataset page: https://huggingface.co/datasets/Sophelio/fusion-equilibrium-challenge.","downloads":2016,"tags":["language:en","license:other","size_categories:1K<n<10K","format:parquet","modality:text","modality:timeseries","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","fusion","plasma-physics","tokamak","scientific-ml","equilibrium-reconstruction","image-regression"],"createdAt":"2026-06-26T14:16:06.000Z","key":""},{"_id":"6a3ebde7c7ffe8ba769038f1","id":"sKT-Ai-Labs/Q-Augmented","author":"sKT-Ai-Labs","disabled":false,"gated":false,"lastModified":"2026-06-27T15:59:38.000Z","likes":2,"trendingScore":2,"private":false,"sha":"ee25129cf524b80ff2a0884f90894b041f103cf7","description":"\n\t\n\t\t\n\t\n\t\n\t\t❤️ Support Our Mission\n\t\n\n\n  🙏 Drop a Heart ❤ For Our Hard Work!\n  \n    If you believe in the vision of Sovereign Indian AI, please show your support by dropping a heart below. Your encouragement fuels our journey!\n  \n  \n  \n    \n      ❤️ Like This Dataset\n    \n    \n      ➕ Follow SKT AI LABS    \n  \n  \n  \n    Kindly follow us for more updates and contribute to our open-source journey!\n  \n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tSKT DATA AUGMENTATION SUITE\n\t\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tQ-AUGMENTED… See the full description on the dataset page: https://huggingface.co/datasets/sKT-Ai-Labs/Q-Augmented.","downloads":64,"tags":["language:en","license:other","region:us","dataset","question-pairs","data-augmentation","synthetic-data","llm-training","evaluation","skt-ai-labs","nrs"],"createdAt":"2026-06-26T17:59:03.000Z","key":""},{"_id":"6a3ec9a59e351e09ade1f9c6","id":"vcl-iisc/DynEval-dataset","author":"vcl-iisc","disabled":false,"gated":"manual","lastModified":"2026-06-27T17:26:25.000Z","likes":2,"trendingScore":2,"private":false,"sha":"3e8cee83cf7517d8092eda6749fcc21b884c467b","description":"\n\t\n\t\t\n\t\n\t\n\t\tDYNEVAL-1K\n\t\n\nDYNEVAL-1K is a 1,036-prompt image generation evaluation set with generated images from 36 text-to-image models and model-response evaluation JSONs for the first 1,000 prompts.\nThe dataset is organized around prompt IDs and zero-padded image filenames. For prompt 0001, each model image is stored as DYNEVAL-1K-IMAGES/<ModelName>/0001.png, and the corresponding evaluation response is stored as model-responses/answers/0001.json.\n\n\t\n\t\t\n\t\n\t\n\t\tContents\n\t\n\nDYNEVAL-1K/\n├──… See the full description on the dataset page: https://huggingface.co/datasets/vcl-iisc/DynEval-dataset.","downloads":22,"tags":["region:us"],"createdAt":"2026-06-26T18:49:09.000Z","key":""},{"_id":"6a3eda828b6665612483a2c5","id":"Tamazight/Tifinagh-OCR-200k","author":"Tamazight","disabled":false,"gated":false,"lastModified":"2026-06-29T23:34:24.000Z","likes":2,"trendingScore":2,"private":false,"sha":"75f24049905380be85a04dd7ab83dda4ff74afd4","description":"\n  \n    \n      \n      @Tamazight\n      Dataset\n    \n    \n      Tifinagh OCR 200k\n    \n    \n      Comprehensive synthetic dataset for Tifinagh script recognition.\n    \n  \n  \n    \n      200K\n      Samples\n    \n    \n      2D30-2D7F\n      Unicode range\n    \n    \n      Apache 2.0\n      License\n    \n  \n\n\n\n  ☀️ Overview\n  Tifinagh OCR 200k is a high-fidelity synthetic dataset meticulously engineered to push the boundaries of OCR performance for the Tifinagh script. With 200,000 unique samples, this… See the full description on the dataset page: https://huggingface.co/datasets/Tamazight/Tifinagh-OCR-200k.","downloads":4091,"tags":["task_categories:image-to-text","task_categories:image-text-to-text","task_categories:image-feature-extraction","language:zgh","language:tzm","language:ber","language:shi","language:gha","language:shy","language:mzb","license:apache-2.0","size_categories:100K<n<1M","format:imagefolder","modality:image","modality:text","library:datasets","library:mlcroissant","region:us","ocr","tifinagh","tamazight","document-understanding","vision-language","synthetic-data"],"createdAt":"2026-06-26T20:01:06.000Z","key":""},{"_id":"6a3eecad0571791d91e5977c","id":"sKT-Ai-Labs/HQ-QUESTION-PAIRS","author":"sKT-Ai-Labs","disabled":false,"gated":false,"lastModified":"2026-06-27T16:00:38.000Z","likes":2,"trendingScore":2,"private":false,"sha":"aef00c73d135ed2002cdf3053d21b78f99f108d7","description":"\n\t\n\t\t\n\t\n\t\n\t\t❤️ Support Our Mission\n\t\n\n\n  🙏 Drop a Heart ❤ For Our Hard Work!\n  \n    If you believe in the vision of Sovereign Indian AI, please show your support by dropping a heart below. Your encouragement fuels our journey!\n  \n  \n  \n    \n      ❤️ Like This Dataset\n    \n    \n      ➕ Follow SKT AI LABS    \n  \n  \n  \n    Kindly follow us for more updates and contribute to our open-source journey!\n  \n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tHQ PAIRS DATASET DIVISION\n\t\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tSKT HIGH-QUALITY QUESTION PAIRS… See the full description on the dataset page: https://huggingface.co/datasets/sKT-Ai-Labs/HQ-QUESTION-PAIRS.","downloads":66,"tags":["license:apache-2.0","size_categories:1M<n<10M","format:text","modality:image","modality:text","library:datasets","library:mlcroissant","region:us","dataset","evaluation","reasoning","coding","system-prompt","llm-benchmark","skt-ai-labs","nrs"],"createdAt":"2026-06-26T21:18:37.000Z","key":""},{"_id":"6a40e6fa974497b9b303134a","id":"fox3000foxy/valorant-ljspeech-piper","author":"fox3000foxy","disabled":false,"gated":false,"lastModified":"2026-06-28T15:46:24.000Z","likes":2,"trendingScore":2,"private":false,"sha":"153543640b16cf4f743d5adfe765778adf857ba7","downloads":49,"tags":["region:us"],"createdAt":"2026-06-28T09:18:50.000Z","key":""},{"_id":"6a41bc875e4f9b06d4060a9e","id":"msw-ai-tf/maplestory-worlds-creator-qa","author":"msw-ai-tf","disabled":false,"gated":false,"lastModified":"2026-06-29T00:30:15.000Z","likes":2,"trendingScore":2,"private":false,"sha":"03cef9581cc5899f3ccf823456f6634d395b2844","description":"\n\t\n\t\t\n\t\n\t\n\t\tMapleStory Worlds Creator QA\n\t\n\nSynthetic question-answer dataset built from the official\nMapleStory Worlds Creator Center\ndocumentation. Questions are generated to be self-contained and grounded in the\nsource docs; answers avoid source/meta references so they read like an expert\nexplanation. Some QA pairs are composed from multiple related documents\n(see combo_sources).\nParallel Korean/English. Intended for instruction tuning, QA, and retrieval.\n\n\t\n\t\t\n\t\n\t\n\t\tComposition… See the full description on the dataset page: https://huggingface.co/datasets/msw-ai-tf/maplestory-worlds-creator-qa.","downloads":7,"tags":["task_categories:question-answering","task_categories:text-generation","multilinguality:multilingual","language:ko","language:en","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","maplestory","maplestory-worlds","game-development","qa","synthetic","instruction"],"createdAt":"2026-06-29T00:29:59.000Z","key":""},{"_id":"6a421708a60d89da199264c9","id":"msw-ai-tf/maplestory-worlds-creator-code-instruct","author":"msw-ai-tf","disabled":false,"gated":false,"lastModified":"2026-06-29T06:56:13.000Z","likes":2,"trendingScore":2,"private":false,"sha":"c3d428a08a1f2cd90b2e55be4b639a26345c5b47","description":"\n\t\n\t\t\n\t\n\t\n\t\tMapleStory Worlds Creator Code (mlua)\n\t\n\nInstruction-style code dataset for mlua, the scripting language of\nMapleStory Worlds. Built from the\nofficial Creator Center example code: each example is grounded in its source\ndocument and paired with a natural-language task, reasoning, a self-contained\nexplanation, and commented mlua code. Intended to teach LLMs to write mlua game\nscripts.\nThe example code is preserved from the official source (a code-preservation check\nrejects any record… See the full description on the dataset page: https://huggingface.co/datasets/msw-ai-tf/maplestory-worlds-creator-code-instruct.","downloads":0,"tags":["task_categories:text-generation","multilinguality:multilingual","language:ko","language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","maplestory","maplestory-worlds","game-development","code","mlua","lua","instruction"],"createdAt":"2026-06-29T06:56:08.000Z","key":""},{"_id":"621ffdd236468d709f181d5a","id":"UCLNLP/adversarial_qa","author":"UCLNLP","disabled":false,"gated":false,"lastModified":"2023-12-21T14:20:00.000Z","likes":44,"trendingScore":1,"private":false,"sha":"c2d5f738db1ad21a4126a144dfbb00cb51e0a4a9","description":"\n\t\n\t\t\n\t\tDataset Card for adversarialQA\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nWe have created three new Reading Comprehension datasets constructed using an adversarial model-in-the-loop.\nWe use three different models; BiDAF (Seo et al., 2016), BERTLarge (Devlin et al., 2018), and RoBERTaLarge (Liu et al., 2019) in the annotation loop and construct three datasets; D(BiDAF), D(BERT), and D(RoBERTa), each with 10,000 training examples, 1,000 validation, and 1,000 test examples.\nThe adversarial human… See the full description on the dataset page: https://huggingface.co/datasets/UCLNLP/adversarial_qa.","downloads":6190,"paperswithcode_id":"adversarialqa","tags":["task_categories:question-answering","task_ids:extractive-qa","task_ids:open-domain-qa","annotations_creators:crowdsourced","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-sa-4.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2002.00293","arxiv:1606.05250","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181d5e","id":"allenai/ai2_arc","author":"allenai","disabled":false,"gated":false,"lastModified":"2023-12-21T15:09:48.000Z","likes":359,"trendingScore":1,"private":false,"sha":"210d026faf9955653af8916fad021475a3f00453","description":"\n\t\n\t\t\n\t\tDataset Card for \"ai2_arc\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nA new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in\n advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains\n only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also\n including a corpus of over 14 million science sentences relevant to… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ai2_arc.","downloads":416663,"tags":["task_categories:question-answering","task_ids:open-domain-qa","task_ids:multiple-choice-qa","annotations_creators:found","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-sa-4.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:1803.05457","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181d91","id":"ParlAI/blended_skill_talk","author":"ParlAI","disabled":false,"gated":false,"lastModified":"2024-01-10T10:22:26.000Z","likes":75,"trendingScore":1,"private":false,"sha":"d7b0093243439fa5f0cd9663125cc47575ced2ea","description":"\n\t\n\t\t\n\t\tDataset Card for \"blended_skill_talk\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nA dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nMore Information Needed\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nMore Information Needed\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\n\n\t\n\t\t\n\t\tdefault\n\t\n\n\nSize of downloaded dataset files: 38.11 MB\nSize of the generated dataset:… See the full description on the dataset page: https://huggingface.co/datasets/ParlAI/blended_skill_talk.","downloads":6583,"paperswithcode_id":"blended-skill-talk","tags":["task_ids:dialogue-generation","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:unknown","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2004.08449","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181d96","id":"bookcorpus/bookcorpus","author":"bookcorpus","disabled":false,"gated":false,"lastModified":"2024-05-03T13:48:33.000Z","likes":357,"trendingScore":1,"private":false,"sha":"d917559bbe9cf49c638fc331c37c4bf239e3b637","citation":"@InProceedings{Zhu_2015_ICCV,\n    title = {Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books},\n    author = {Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja},\n    booktitle = {The IEEE International Conference on Computer Vision (ICCV)},\n    month = {December},\n    year = {2015}\n}","description":"Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story.This work aims to align books to their movie releases in order to providerich descriptive explanations for visual content that go semantically farbeyond the captions available in current datasets. \\","downloads":9603,"paperswithcode_id":"bookcorpus","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:unknown","size_categories:10M<n<100M","arxiv:2105.05241","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181d9b","id":"UFRGS/brwac","author":"UFRGS","disabled":false,"gated":false,"lastModified":"2024-01-18T11:02:06.000Z","likes":25,"trendingScore":1,"private":false,"sha":"3475bc217e5241f9a5c833b2f8ae9b74a2d7e44d","citation":"@inproceedings{wagner2018brwac,\n  title={The brwac corpus: A new open resource for brazilian portuguese},\n  author={Wagner Filho, Jorge A and Wilkens, Rodrigo and Idiart, Marco and Villavicencio, Aline},\n  booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},\n  year={2018}\n}","description":"The BrWaC (Brazilian Portuguese Web as Corpus) is a large corpus constructed following the Wacky framework,\nwhich was made public for research purposes. The current corpus version, released in January 2017, is composed by\n3.53 million documents, 2.68 billion tokens and 5.79 million types. Please note that this resource is available\nsolely for academic research purposes, and you agreed not to use it for any commercial applications.\nManually download at https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC","downloads":159,"paperswithcode_id":"brwac","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","source_datasets:original","language:pt","license:unknown","size_categories:1M<n<10M","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181db0","id":"uoft-cs/cifar10","author":"uoft-cs","disabled":false,"gated":false,"lastModified":"2024-01-04T06:53:11.000Z","likes":108,"trendingScore":1,"private":false,"sha":"0b2714987fa478483af9968de7c934580d0bb9a2","description":"\n\t\n\t\t\n\t\tDataset Card for CIFAR-10\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.\nThe dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.","downloads":175999,"paperswithcode_id":"cifar-10","tags":["task_categories:image-classification","annotations_creators:crowdsourced","language_creators:found","multilinguality:monolingual","source_datasets:extended|other-80-Million-Tiny-Images","language:en","license:unknown","size_categories:10K<n<100K","format:parquet","modality:image","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181dbe","id":"code-search-net/code_search_net","author":"code-search-net","disabled":false,"gated":false,"lastModified":"2026-02-23T14:11:41.000Z","likes":332,"trendingScore":1,"private":false,"sha":"bd0cf261e357a3eb5c8fba490d23ec1a1cd59555","description":"\n\t\n\t\t\n\t\tDataset Card for CodeSearchNet corpus\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nCodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.\nCodeSearchNet corpus was gathered to support the CodeSearchNet challenge, to explore the problem of code retrieval using natural language.\n\n\t\n\t\t\n\t\n\t\n\t\tSupported Tasks and Leaderboards\n\t\n\n\nlanguage-modeling: The dataset can be used to… See the full description on the dataset page: https://huggingface.co/datasets/code-search-net/code_search_net.","downloads":22474,"paperswithcode_id":"codesearchnet","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:machine-generated","multilinguality:multilingual","source_datasets:original","language:code","license:other","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:1909.09436","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181de7","id":"facebook/covost2","author":"facebook","disabled":false,"gated":false,"lastModified":"2024-01-18T11:02:25.000Z","likes":46,"trendingScore":1,"private":false,"sha":"369b47c4c20aff1193b8edeeedc37d14ae28226b","citation":"@misc{wang2020covost,\n    title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},\n    author={Changhan Wang and Anne Wu and Juan Pino},\n    year={2020},\n    eprint={2007.10310},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}","description":"CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla’s open source Common Voice database of crowdsourced voice recordings.\n\nNote that in order to limit the required storage for preparing this dataset, the audio\nis stored in the .mp3 format and is not converted to a float32 array. To convert, the audio\nfile to a float32 array, please make use of the `.map()` function as follows:\n\n\n```python\nimport torchaudio\n\ndef map_to_array(batch):\n    speech_array, _ = torchaudio.load(batch[\"file\"])\n    batch[\"speech\"] = speech_array.numpy()\n    return batch\n\ndataset = dataset.map(map_to_array, remove_columns=[\"file\"])\n```","downloads":522,"tags":["task_categories:automatic-speech-recognition","annotations_creators:expert-generated","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","source_datasets:extended|other-common-voice","language:ar","language:ca","language:cy","language:de","language:es","language:et","language:fa","language:fr","language:id","language:it","language:ja","language:lv","language:mn","language:nl","language:pt","language:ru","language:sl","language:sv","language:ta","language:tr","language:zh","license:cc-by-nc-4.0","size_categories:100K<n<1M","arxiv:2007.10310","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181e06","id":"ucinlp/drop","author":"ucinlp","disabled":false,"gated":false,"lastModified":"2024-01-17T08:15:43.000Z","likes":69,"trendingScore":1,"private":false,"sha":"95cda593fae71b60b5b19f82de3fcf3298c1239c","description":"\n\t\n\t\t\n\t\tDataset Card for \"drop\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nDROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs.\n. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a\nquestion, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or\n sorting). These operations require a much more comprehensive understanding of the content of paragraphs than… See the full description on the dataset page: https://huggingface.co/datasets/ucinlp/drop.","downloads":7625,"paperswithcode_id":"drop","tags":["task_categories:question-answering","task_ids:extractive-qa","task_ids:abstractive-qa","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-sa-4.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:1903.00161","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181e2e","id":"takala/financial_phrasebank","author":"takala","disabled":false,"gated":false,"lastModified":"2025-12-15T05:51:57.000Z","likes":263,"trendingScore":1,"private":false,"sha":"8d3fe0c36d5feec6b3cc5e455b0fcb4820fb9964","citation":"@article{Malo2014GoodDO,\n  title={Good debt or bad debt: Detecting semantic orientations in economic texts},\n  author={P. Malo and A. Sinha and P. Korhonen and J. Wallenius and P. Takala},\n  journal={Journal of the Association for Information Science and Technology},\n  year={2014},\n  volume={65}\n}","description":"The key arguments for the low utilization of statistical techniques in\nfinancial sentiment analysis have been the difficulty of implementation for\npractical applications and the lack of high quality training data for building\nsuch models. Especially in the case of finance and economic texts, annotated\ncollections are a scarce resource and many are reserved for proprietary use\nonly. To resolve the missing training data problem, we present a collection of\n∼ 5000 sentences to establish human-annotated standards for benchmarking\nalternative modeling techniques.\n\nThe objective of the phrase level annotation task was to classify each example\nsentence into a positive, negative or neutral category by considering only the\ninformation explicitly available in the given sentence. Since the study is\nfocused only on financial and economic domains, the annotators were asked to\nconsider the sentences from the view point of an investor only; i.e. whether\nthe news may have positive, negative or neutral influence on the stock price.\nAs a result, sentences which have a sentiment that is not relevant from an\neconomic or financial perspective are considered neutral.\n\nThis release of the financial phrase bank covers a collection of 4840\nsentences. The selected collection of phrases was annotated by 16 people with\nadequate background knowledge on financial markets. Three of the annotators\nwere researchers and the remaining 13 annotators were master’s students at\nAalto University School of Business with majors primarily in finance,\naccounting, and economics.\n\nGiven the large number of overlapping annotations (5 to 8 annotations per\nsentence), there are several ways to define a majority vote based gold\nstandard. To provide an objective comparison, we have formed 4 alternative\nreference datasets based on the strength of majority agreement: all annotators\nagree, >=75% of annotators agree, >=66% of annotators agree and >=50% of\nannotators agree.","downloads":7147,"tags":["task_categories:text-classification","task_ids:multi-class-classification","task_ids:sentiment-classification","annotations_creators:expert-generated","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-nc-sa-3.0","size_categories:1K<n<10K","arxiv:1307.5336","region:us","finance"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181e65","id":"hotpotqa/hotpot_qa","author":"hotpotqa","disabled":false,"gated":false,"lastModified":"2025-08-11T10:16:27.000Z","likes":307,"trendingScore":1,"private":false,"sha":"1908d6afbbead072334abe2965f91bd2709910ab","description":"\n\t\n\t\t\n\t\tDataset Card for \"hotpot_qa\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nHotpotQA is a new dataset with 113k  Wikipedia-based question-answer  pairs with  four  key  features:  (1)  the  questions  require finding and reasoning over multiple supporting  documents  to  answer;  (2)  the  questions  are  diverse  and  not  constrained  to  any pre-existing  knowledge  bases  or  knowledge schemas;  (3)  we  provide  sentence-level  supporting facts required for reasoning, allowingQA systems to reason… See the full description on the dataset page: https://huggingface.co/datasets/hotpotqa/hotpot_qa.","downloads":79690,"paperswithcode_id":"hotpotqa","tags":["task_categories:question-answering","annotations_creators:crowdsourced","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-sa-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:1809.09600","region:us","multi-hop"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181e77","id":"stanfordnlp/imdb","author":"stanfordnlp","disabled":false,"gated":false,"lastModified":"2024-01-04T12:09:45.000Z","likes":388,"trendingScore":1,"private":false,"sha":"e6281661ce1c48d982bc483cf8a173c1bbeb5d31","description":"\n\t\n\t\t\n\t\tDataset Card for \"imdb\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nLarge Movie Review Dataset.\nThis is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nMore Information Needed\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nMore Information Needed\n\n\t\n\t\t\n\t\tDataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.","downloads":184568,"paperswithcode_id":"imdb-movie-reviews","tags":["task_categories:text-classification","task_ids:sentiment-classification","annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:other","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181e87","id":"google/jigsaw_unintended_bias","author":"google","disabled":false,"gated":false,"lastModified":"2024-01-18T11:06:57.000Z","likes":8,"trendingScore":1,"private":false,"sha":"d46022c9df7c8b74ba52876f314f81ef60fa1727","description":"A collection of comments from the defunct Civil Comments platform that have been annotated for their toxicity.","downloads":744,"tags":["task_categories:text-classification","task_ids:text-scoring","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:cc0-1.0","size_categories:1M<n<10M","region:us","toxicity-prediction"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181ee2","id":"deepmind/narrativeqa","author":"deepmind","disabled":false,"gated":false,"lastModified":"2024-03-06T07:33:05.000Z","likes":66,"trendingScore":1,"private":false,"sha":"2e643e7363944af1c33a652d1c87320d0871c4e4","description":"\n\t\n\t\t\n\t\tDataset Card for Narrative QA\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nNarrativeQA is an English-lanaguage dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nThe dataset is used to test reading comprehension. There are 2 tasks proposed in the paper: \"summaries only\" and \"stories only\", depending on whether the human-generated summary or the full story text is used to answer the question.… See the full description on the dataset page: https://huggingface.co/datasets/deepmind/narrativeqa.","downloads":10958,"paperswithcode_id":"narrativeqa","tags":["task_ids:abstractive-qa","annotations_creators:crowdsourced","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:1712.07040","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181f09","id":"Skylion007/openwebtext","author":"Skylion007","disabled":false,"gated":false,"lastModified":"2025-12-26T16:47:16.000Z","likes":523,"trendingScore":1,"private":false,"sha":"b4325f019c648b1641a1784748667e8b74e5e064","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for \"openwebtext\"\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nAn open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2.\nThis distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University.\n\n\t\n\t\t\n\t\n\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nMore Information Needed\n\n\t\n\t\t\n\t\n\t\n\t\tLanguages\n\t\n\nMore Information Needed\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tData Instances\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tplain_text\n\t\n\n\nSize of downloaded dataset… See the full description on the dataset page: https://huggingface.co/datasets/Skylion007/openwebtext.","downloads":70964,"paperswithcode_id":"openwebtext","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:cc0-1.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181f36","id":"rmyeid/polyglot_ner","author":"rmyeid","disabled":false,"gated":false,"lastModified":"2024-01-18T11:13:26.000Z","likes":40,"trendingScore":1,"private":false,"sha":"1c4ec77a49b25d59198e9c80b652925e50163379","citation":"@article{polyglotner,\n         author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},\n         title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},\n         journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30- May 2, 2015}},\n         month     = {April},\n         year      = {2015},\n         publisher = {SIAM},\n}","description":"Polyglot-NER\nA training dataset automatically generated from Wikipedia and Freebase the task\nof named entity recognition. The dataset contains the basic Wikipedia based\ntraining data for 40 languages we have (with coreference resolution) for the task of\nnamed entity recognition. The details of the procedure of generating them is outlined in\nSection 3 of the paper (https://arxiv.org/abs/1410.3791). Each config contains the data\ncorresponding to a different language. For example, \"es\" includes only spanish examples.","downloads":542,"paperswithcode_id":"polyglot-ner","tags":["task_categories:token-classification","task_ids:named-entity-recognition","annotations_creators:machine-generated","language_creators:found","multilinguality:multilingual","source_datasets:original","language:ar","language:bg","language:ca","language:cs","language:da","language:de","language:el","language:en","language:es","language:et","language:fa","language:fi","language:fr","language:he","language:hi","language:hr","language:hu","language:id","language:it","language:ja","language:ko","language:lt","language:lv","language:ms","language:nl","language:no","language:pl","language:pt","language:ro","language:ru","language:sk","language:sl","language:sr","language:sv","language:th","language:tl","language:tr","language:uk","language:vi","language:zh","license:unknown","arxiv:1410.3791","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181f3c","id":"ncbi/pubmed","author":"ncbi","disabled":false,"gated":false,"lastModified":"2024-01-26T17:52:23.000Z","likes":165,"trendingScore":1,"private":false,"sha":"8263182bec2a83be3f73706759adada1e1bfe3c6","citation":"Courtesy of the U.S. National Library of Medicine.","description":"NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.","downloads":1105,"paperswithcode_id":"pubmed","tags":["task_categories:text-generation","task_categories:fill-mask","task_categories:text-classification","task_ids:language-modeling","task_ids:masked-language-modeling","task_ids:text-scoring","task_ids:topic-classification","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:other","size_categories:10M<n<100M","region:us","citation-estimation"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181f3d","id":"qiaojin/PubMedQA","author":"qiaojin","disabled":false,"gated":false,"lastModified":"2024-03-06T01:50:16.000Z","likes":327,"trendingScore":1,"private":false,"sha":"9001f2853fb87cab8d220904e0de81ac6973b318","description":"\n\t\n\t\t\n\t\tDataset Card for [Dataset Name]\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nThe official leaderboard is available at: https://pubmedqa.github.io/.\n500 questions in the pqa_labeled are used as the test set. They can be found at… See the full description on the dataset page: https://huggingface.co/datasets/qiaojin/PubMedQA.","downloads":26863,"paperswithcode_id":"pubmedqa","tags":["task_categories:question-answering","task_ids:multiple-choice-qa","annotations_creators:expert-generated","annotations_creators:machine-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:1909.06146","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181f89","id":"stanfordnlp/snli","author":"stanfordnlp","disabled":false,"gated":false,"lastModified":"2024-03-06T10:55:50.000Z","likes":94,"trendingScore":1,"private":false,"sha":"cdb5c3d5eed6ead6e5a341c8e56e669bb666725b","description":"\n\t\n\t\t\n\t\tDataset Card for SNLI\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nNatural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is the… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/snli.","downloads":22637,"paperswithcode_id":"snli","tags":["task_categories:text-classification","task_ids:natural-language-inference","task_ids:multi-input-text-classification","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:extended|other-flicker-30k","source_datasets:extended|other-visual-genome","language:en","license:cc-by-sa-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:1508.05326","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f181ff2","id":"AILAB-VNUHCM/vivos","author":"AILAB-VNUHCM","disabled":false,"gated":false,"lastModified":"2023-06-14T08:29:21.000Z","likes":18,"trendingScore":1,"private":false,"sha":"3cbfb2502e5e84776b4b778b020a09759f723f52","citation":"\\\r\n@inproceedings{luong-vu-2016-non,\r\n    title = \"A non-expert {K}aldi recipe for {V}ietnamese Speech Recognition System\",\r\n    author = \"Luong, Hieu-Thi  and\r\n      Vu, Hai-Quan\",\r\n    booktitle = \"Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies ({WLSI}/{OIAF}4{HLT}2016)\",\r\n    month = dec,\r\n    year = \"2016\",\r\n    address = \"Osaka, Japan\",\r\n    publisher = \"The COLING 2016 Organizing Committee\",\r\n    url = \"https://aclanthology.org/W16-5207\",\r\n    pages = \"51--55\",\r\n}","description":"\\\r\nVIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for\r\nVietnamese Automatic Speech Recognition task.\r\nThe corpus was prepared by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan is the head of.\r\nWe publish this corpus in hope to attract more scientists to solve Vietnamese speech recognition problems.","downloads":487,"tags":["task_categories:automatic-speech-recognition","annotations_creators:expert-generated","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:vi","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f18200b","id":"legacy-datasets/wikipedia","author":"legacy-datasets","disabled":false,"gated":false,"lastModified":"2024-03-11T18:16:32.000Z","likes":646,"trendingScore":1,"private":false,"sha":"97a0b052c326b45fb68593a14972d9eed884cd17","citation":"@ONLINE {wikidump,\n    author = {Wikimedia Foundation},\n    title  = {Wikimedia Downloads},\n    url    = {https://dumps.wikimedia.org}\n}","description":"Wikipedia dataset containing cleaned articles of all languages.\nThe datasets are built from the Wikipedia dump\n(https://dumps.wikimedia.org/) with one split per language. Each example\ncontains the content of one full Wikipedia article with cleaning to strip\nmarkdown and unwanted sections (references, etc.).","downloads":116542,"tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:crowdsourced","multilinguality:multilingual","source_datasets:original","language:aa","language:ab","language:ace","language:af","language:ak","language:als","language:am","language:an","language:ang","language:ar","language:arc","language:arz","language:as","language:ast","language:atj","language:av","language:ay","language:az","language:azb","language:ba","language:bar","language:bcl","language:be","language:bg","language:bh","language:bi","language:bjn","language:bm","language:bn","language:bo","language:bpy","language:br","language:bs","language:bug","language:bxr","language:ca","language:cbk","language:cdo","language:ce","language:ceb","language:ch","language:cho","language:chr","language:chy","language:ckb","language:co","language:cr","language:crh","language:cs","language:csb","language:cu","language:cv","language:cy","language:da","language:de","language:din","language:diq","language:dsb","language:dty","language:dv","language:dz","language:ee","language:el","language:eml","language:en","language:eo","language:es","language:et","language:eu","language:ext","language:fa","language:ff","language:fi","language:fj","language:fo","language:fr","language:frp","language:frr","language:fur","language:fy","language:ga","language:gag","language:gan","language:gd","language:gl","language:glk","language:gn","language:gom","language:gor","language:got","language:gu","language:gv","language:ha","language:hak","language:haw","language:he","language:hi","language:hif","language:ho","language:hr","language:hsb","language:ht","language:hu","language:hy","language:ia","language:id","language:ie","language:ig","language:ii","language:ik","language:ilo","language:inh","language:io","language:is","language:it","language:iu","language:ja","language:jam","language:jbo","language:jv","language:ka","language:kaa","language:kab","language:kbd","language:kbp","language:kg","language:ki","language:kj","language:kk","language:kl","language:km","language:kn","language:ko","language:koi","language:krc","language:ks","language:ksh","language:ku","language:kv","language:kw","language:ky","language:la","language:lad","language:lb","language:lbe","language:lez","language:lfn","language:lg","language:li","language:lij","language:lmo","language:ln","language:lo","language:lrc","language:lt","language:ltg","language:lv","language:lzh","language:mai","language:mdf","language:mg","language:mh","language:mhr","language:mi","language:min","language:mk","language:ml","language:mn","language:mr","language:mrj","language:ms","language:mt","language:mus","language:mwl","language:my","language:myv","language:mzn","language:na","language:nah","language:nan","language:nap","language:nds","language:ne","language:new","language:ng","language:nl","language:nn","language:no","language:nov","language:nrf","language:nso","language:nv","language:ny","language:oc","language:olo","language:om","language:or","language:os","language:pa","language:pag","language:pam","language:pap","language:pcd","language:pdc","language:pfl","language:pi","language:pih","language:pl","language:pms","language:pnb","language:pnt","language:ps","language:pt","language:qu","language:rm","language:rmy","language:rn","language:ro","language:ru","language:rue","language:rup","language:rw","language:sa","language:sah","language:sat","language:sc","language:scn","language:sco","language:sd","language:se","language:sg","language:sgs","language:sh","language:si","language:sk","language:sl","language:sm","language:sn","language:so","language:sq","language:sr","language:srn","language:ss","language:st","language:stq","language:su","language:sv","language:sw","language:szl","language:ta","language:tcy","language:tdt","language:te","language:tg","language:th","language:ti","language:tk","language:tl","language:tn","language:to","language:tpi","language:tr","language:ts","language:tt","language:tum","language:tw","language:ty","language:tyv","language:udm","language:ug","language:uk","language:ur","language:uz","language:ve","language:vec","language:vep","language:vi","language:vls","language:vo","language:vro","language:wa","language:war","language:wo","language:wuu","language:xal","language:xh","language:xmf","language:yi","language:yo","language:yue","language:za","language:zea","language:zh","language:zu","license:cc-by-sa-3.0","license:gfdl","size_categories:n<1K","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f182018","id":"wmt/wmt16","author":"wmt","disabled":false,"gated":false,"lastModified":"2024-04-03T12:30:24.000Z","likes":27,"trendingScore":1,"private":false,"sha":"41d8a4013aa1489f28fea60ec0932af246086482","description":"\n\t\n\t\t\n\t\tDataset Card for \"wmt16\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\n\n  Warning: There are issues with the Common Crawl corpus data (training-parallel-commoncrawl.tgz):\n  \n    Non-English files contain many English sentences.\n    Their \"parallel\" sentences in English are not aligned: they are uncorrelated with their counterpart.\n  \n  We have contacted the WMT organizers, and in response, they have indicated that they do not have plans to update the Common Crawl corpus data. Their rationale pertains… See the full description on the dataset page: https://huggingface.co/datasets/wmt/wmt16.","downloads":7239,"paperswithcode_id":"wmt-2016","tags":["task_categories:translation","annotations_creators:no-annotation","language_creators:found","multilinguality:translation","source_datasets:extended|europarl_bilingual","source_datasets:extended|news_commentary","source_datasets:extended|setimes","source_datasets:extended|un_multi","language:cs","language:de","language:en","language:fi","language:ro","language:ru","language:tr","license:unknown","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f182755","id":"SetFit/sst5","author":"SetFit","disabled":false,"gated":false,"lastModified":"2021-12-25T06:10:36.000Z","likes":21,"trendingScore":1,"private":false,"sha":"e51bdcd8cd3a30da231967c1a249ba59361279a3","description":"\n\t\n\t\t\n\t\tStanford Sentiment Treebank - Fine-Grained\n\t\n\nStanford Sentiment Treebank with 5 labels: very positive, positive, neutral, negative, very negative\nSplits are from: \nhttps://github.com/AcademiaSinicaNLPLab/sentiment_dataset/tree/master/data\nTraining data is on sentence level, not on phrase level!\n","downloads":14406,"tags":["size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f183019","id":"bs-modeling-metadata/website_metadata_c4","author":"bs-modeling-metadata","disabled":false,"gated":false,"lastModified":"2021-11-24T14:04:30.000Z","likes":4,"trendingScore":1,"private":false,"sha":"f6cba351f9d1893a3b60f76455cdbd64fc0239c7","description":"The dataset is in the form of a json lines file with 1,20,000 examples, where an example consists of text (extracted from C4 English dataset) and metadata fields (website description extracted from Wikipedia).\nExample:\n{\n    \"text\": \"US10289222B2 - Handling of touch events in a browser environment - Google Patents\\nHandling of touch events in a browser environment Download PDF\\nUS10289222B2\\nUS10289222B2 US13/857,848 US201313857848A US10289222B2 US 10289222 B2 US10289222 B2 US 10289222B2 US… See the full description on the dataset page: https://huggingface.co/datasets/bs-modeling-metadata/website_metadata_c4.","downloads":75,"tags":["size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f183057","id":"castorini/afriberta-corpus","author":"castorini","disabled":false,"gated":false,"lastModified":"2022-10-19T21:33:04.000Z","likes":17,"trendingScore":1,"private":false,"sha":"d83da9653ef2a5f823c3693a28018e3009464522","citation":"@inproceedings{ogueji-etal-2021-small,\n    title = \"Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages\",\n    author = \"Ogueji, Kelechi  and\n      Zhu, Yuxin  and\n      Lin, Jimmy\",\n    booktitle = \"Proceedings of the 1st Workshop on Multilingual Representation Learning\",\n    month = nov,\n    year = \"2021\",\n    address = \"Punta Cana, Dominican Republic\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.mrl-1.11\",\n    pages = \"116--126\",\n}","description":"Corpus used for training AfriBERTa models","downloads":191,"tags":["task_categories:text-generation","task_ids:language-modeling","language:om","language:am","language:rw","language:rn","language:ha","language:ig","language:pcm","language:so","language:sw","language:ti","language:yo","language:multilingual","license:apache-2.0","size_categories:1M<n<10M","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f183069","id":"ccdv/arxiv-classification","author":"ccdv","disabled":false,"gated":false,"lastModified":"2024-08-08T05:50:57.000Z","likes":25,"trendingScore":1,"private":false,"sha":"d4bf86db1cecd1f42e5ef24f1718d7c02c3250a7","description":"Arxiv Classification: a classification of Arxiv Papers (11 classes). \nThis dataset is intended for long context classification (documents have all > 4k tokens). Copied from \"Long Document Classification From Local Word Glimpses via Recurrent Attention Learning\"\n@ARTICLE{8675939,\n  author={He, Jun and Wang, Liqun and Liu, Liu and Feng, Jiao and Wu, Hao},\n  journal={IEEE Access}, \n  title={Long Document Classification From Local Word Glimpses via Recurrent Attention Learning}, \n  year={2019}… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-classification.","downloads":4018,"tags":["task_categories:text-classification","task_ids:multi-class-classification","task_ids:topic-classification","language:en","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","long context"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f18306a","id":"ccdv/arxiv-summarization","author":"ccdv","disabled":false,"gated":false,"lastModified":"2024-08-08T05:49:50.000Z","likes":125,"trendingScore":1,"private":false,"sha":"240aaf1a969b3f8cd0ade6986bfad0cd730ee288","description":"\n\t\n\t\t\n\t\tArxiv dataset for summarization\n\t\n\nDataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns \" \".join(text) and add \"\\n\" for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable:\n\"ccdv/arxiv-summarization\": (\"article\", \"abstract\")\n\n\n\t\n\t\n\t\n\t\tData Fields\n\t\n\n\nid: paper id\narticle: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.","downloads":8494,"tags":["task_categories:summarization","task_categories:text-generation","multilinguality:monolingual","language:en","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","conditional-text-generation"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f18306d","id":"ccdv/patent-classification","author":"ccdv","disabled":false,"gated":false,"lastModified":"2024-08-08T05:49:40.000Z","likes":30,"trendingScore":1,"private":false,"sha":"17bd94cd6594bac1b82a697e811fb7cf34a7af35","description":"Patent Classification: a classification of Patents and abstracts (9 classes). \nThis dataset is intended for long context classification (non abstract documents are longer that 512 tokens). Data are sampled from \"BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization.\" by Eva Sharma, Chen Li and Lu Wang \n\nSee: https://aclanthology.org/P19-1212.pdf \nSee: https://evasharma.github.io/bigpatent/\n\nIt contains 9 unbalanced classes, 35k Patents and abstracts divided into 3 splits:… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/patent-classification.","downloads":1585,"tags":["task_categories:text-classification","task_ids:multi-class-classification","task_ids:topic-classification","language:en","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","long context"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f1831f1","id":"deepset/germanquad","author":"deepset","disabled":false,"gated":false,"lastModified":"2023-04-06T13:58:35.000Z","likes":43,"trendingScore":1,"private":false,"sha":"fff05ceaf2ffbe5b65c7e0c57e678f7b7e1a0581","citation":"@misc{möller2021germanquad,\n      title={GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval}, \n      author={Timo Möller and Julian Risch and Malte Pietsch},\n      year={2021},\n      eprint={2104.12741},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}","description":"In order to raise the bar for non-English QA, we are releasing a high-quality, human-labeled German QA dataset consisting of 13 722 questions, incl. a three-way annotated test set.\nThe creation of GermanQuAD is inspired by insights from existing datasets as well as our labeling experience from several industry projects. We combine the strengths of SQuAD, such as high out-of-domain performance, with self-sufficient questions that contain all relevant information for open-domain QA as in the NaturalQuestions dataset. Our training and test datasets do not overlap like other popular datasets and include complex questions that cannot be answered with a single entity or only a few words.","downloads":350,"tags":["task_categories:question-answering","task_categories:text-retrieval","task_ids:extractive-qa","task_ids:closed-domain-qa","task_ids:open-domain-qa","multilinguality:monolingual","source_datasets:original","language:de","license:cc-by-4.0","arxiv:2104.12741","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f1835c0","id":"huggingartists/xxxtentacion","author":"huggingartists","disabled":false,"gated":false,"lastModified":"2022-10-25T09:50:12.000Z","likes":1,"trendingScore":1,"private":false,"sha":"e838ad2ec7f92da7b91da83b7d11034ca392ec70","citation":"@InProceedings{huggingartists:dataset,\ntitle = {Lyrics dataset},\nauthor={Aleksey Korshuk\n},\nyear={2021}\n}","description":"This dataset is designed to generate lyrics with HuggingArtists.","downloads":30,"tags":["language:en","size_categories:n<1K","modality:text","library:datasets","library:mlcroissant","region:us","huggingartists","lyrics"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f1835f4","id":"iarfmoose/qa_evaluator","author":"iarfmoose","disabled":false,"gated":false,"lastModified":"2021-11-29T05:20:10.000Z","likes":4,"trendingScore":1,"private":false,"sha":"7e9a0fb84fd6c61d81fab5718bdb235f93625600","description":"This is the same dataset as the question_generator dataset but with the context removed and the question and answer in separate fields. This is intended to be used with the question_generator repo to train the qa_evaluator model which predicts whether a question and answer pair makes sense.\n","downloads":220,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f1838b5","id":"lhoestq/demo1","author":"lhoestq","disabled":false,"gated":false,"lastModified":"2021-11-08T14:36:41.000Z","likes":4,"trendingScore":1,"private":false,"sha":"87ecf163bedca9d80598b528940a9c4f99e14c11","description":"\n\t\n\t\t\n\t\tDataset Card for Demo1\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis is a demo dataset. It consists in two files data/train.csv and data/test.csv\nYou can load it with\nfrom datasets import load_dataset \ndemo1 = load_dataset(\"lhoestq/demo1\")  \n\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\n[More Information Needed]\n\n\t\n\t\t\n\t\tLanguages\n\t\n\n[More Information Needed]\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\n[More Information Needed]\n\n\t\n\t\t\n\t\tData Fields\n\t\n\n[More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/lhoestq/demo1.","downloads":3591,"tags":["size_categories:n<1K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f1839f0","id":"maydogan/Turkish_SentimentAnalysis_TRSAv1","author":"maydogan","disabled":false,"gated":false,"lastModified":"2024-10-07T14:16:56.000Z","likes":15,"trendingScore":1,"private":false,"sha":"3bda52ed2c57ac1e80ac463192cafeeae9f72213","description":"TRSAv1 (Turkish Sentiment Analysis Version 1) Dataset\nThis data set has been produced to contribute to Turkish NLP studies. \nThe dataset consists of a total of 150 thousand samples, 50 thousand negative, 50 thousand positive, and 50 thousand neutral. \nIt can be used in text classification and sentiment analysis studies by citing the related study.\nRelated Work\nAydoğan M, Kocaman V. TRSAv1: A new benchmark dataset for classifying user reviews on Turkish e-commerce websites. Journal of… See the full description on the dataset page: https://huggingface.co/datasets/maydogan/Turkish_SentimentAnalysis_TRSAv1.","downloads":203,"tags":["task_categories:text-classification","language:tr","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f183b05","id":"indonesian-nlp/mc4-id","author":"indonesian-nlp","disabled":false,"gated":false,"lastModified":"2022-10-25T11:52:34.000Z","likes":14,"trendingScore":1,"private":false,"sha":"38479a7a477f2388e20048c6161dc3b122575ea9","citation":"@article{JMLR:v21:20-074,\n  author  = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},\n  title   = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},\n  journal = {Journal of Machine Learning Research},\n  year    = {2020},\n  volume  = {21},\n  number  = {140},\n  pages   = {1-67},\n  url     = {http://jmlr.org/papers/v21/20-074.html}\n}","description":"A thoroughly cleaned version of the Italian portion of the multilingual \ncolossal, cleaned version of Common Crawl's web crawl corpus (mC4) by AllenAI.\nBased on Common Crawl dataset: \"https://commoncrawl.org\".\nThis is the processed version of Google's mC4 dataset by AllenAI, with further cleaning\ndetailed in the repository README file.","downloads":326,"paperswithcode_id":"mc4","tags":["task_categories:text-generation","task_ids:language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","source_datasets:extended","language:id","license:odc-by","size_categories:1M<n<10M","modality:text","library:datasets","library:mlcroissant","arxiv:1910.10683","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f183be7","id":"nielsr/funsd","author":"nielsr","disabled":false,"gated":false,"lastModified":"2025-06-23T14:34:08.000Z","likes":20,"trendingScore":1,"private":false,"sha":"7e7eeeedd84ce86540eb83cbbf7c75a3fcc7c7a5","downloads":2894,"tags":["size_categories:n<1K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f183d15","id":"pile-of-law/pile-of-law","author":"pile-of-law","disabled":false,"gated":false,"lastModified":"2023-01-08T03:10:35.000Z","likes":277,"trendingScore":1,"private":false,"sha":"0dc9f2c26b42af4cb6330f36d6146e82f9117a3b","citation":"@misc{hendersonkrass2022pileoflaw,\n  url = {https://arxiv.org/abs/2207.00220},\n  author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},\n  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},\n  publisher = {arXiv},\n  year = {2022}\n}","description":"We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.","downloads":2473,"tags":["task_categories:fill-mask","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","language:en","license:cc-by-nc-sa-4.0","size_categories:10M<n<100M","arxiv:2207.00220","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"621ffdd236468d709f18417a","id":"uitnlp/vietnamese_students_feedback","author":"uitnlp","disabled":false,"gated":false,"lastModified":"2022-10-13T15:39:37.000Z","likes":31,"trendingScore":1,"private":false,"sha":"7b56c6cb1c9c8523249f407044c838660df3811a","citation":"@InProceedings{8573337,\n  author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy},\n  booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)},\n  title={UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis},\n  year={2018},\n  volume={},\n  number={},\n  pages={19-24},\n  doi={10.1109/KSE.2018.8573337}\n}","description":"Students’ feedback is a vital resource for the interdisciplinary research involving the combining of two different\nresearch fields between sentiment analysis and education.\n\nVietnamese Students’ Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are\nhuman-annotated with two different tasks: sentiment-based and topic-based classifications.\n\nTo assess the quality of our corpus, we measure the annotator agreements and classification evaluation on the\nUIT-VSFC corpus. As a result, we obtained the inter-annotator agreement of sentiments and topics with more than over\n91% and 71% respectively. In addition, we built the baseline model with the Maximum Entropy classifier and achieved\napproximately 88% of the sentiment F1-score and over 84% of the topic F1-score.","downloads":1685,"tags":["task_categories:text-classification","task_ids:sentiment-classification","task_ids:topic-classification","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","source_datasets:original","language:vi","license:unknown","size_categories:10K<n<100K","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2022-03-02T23:29:22.000Z","key":""},{"_id":"622210c78898104f849914d1","id":"Alvenir/alvenir_asr_da_eval","author":"Alvenir","disabled":false,"gated":false,"lastModified":"2025-05-16T07:49:15.000Z","likes":7,"trendingScore":1,"private":false,"sha":"f049792fce529309c84d13e5143279d453924d16","description":"\n\t\n\t\t\n\t\tDataset Card alvenir_asr_da_eval\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset was created by Alvenir in order to evaluate ASR models in Danish. It can also be used for training but the amount is very limited.\nThe dataset consists of .wav files with corresponding reference text. The amount of data is just above 5 hours spread across 50 speakers with age in the interval 20-60 years old. The data was collected by a third party vendor through their software and people. All recordings have… See the full description on the dataset page: https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval.","downloads":63,"tags":["license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:audio","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-03-04T13:14:47.000Z","key":""},{"_id":"622d0eea1fa22851716c620c","id":"ctheodoris/Genecorpus-30M","author":"ctheodoris","disabled":false,"gated":false,"lastModified":"2024-03-25T23:01:49.000Z","likes":85,"trendingScore":1,"private":false,"sha":"71fae97a141652584e0cf09e8ace306e9b9e6d8c","description":"\n\t\n\t\t\n\t\tDataset Card for Genecorpus-30M\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\nPoint of Contact: christina.theodoris@gladstone.ucsf.edu\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nWe assembled a large-scale pretraining corpus, Genecorpus-30M, comprised of ~30 million human single cell transcriptomes from a broad range of tissues from publicly available data. This corpus was used for pretraining Geneformer, a pretrained transformer model that enables context-aware predictions in settings with limited data in… See the full description on the dataset page: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M.","downloads":511,"tags":["license:apache-2.0","region:us"],"createdAt":"2022-03-12T21:21:46.000Z","key":""},{"_id":"6241b8793534f6d909d2edb4","id":"carolina-c4ai/corpus-carolina","author":"carolina-c4ai","disabled":false,"gated":false,"lastModified":"2025-06-11T11:45:38.000Z","likes":29,"trendingScore":1,"private":false,"sha":"55e63a519393c70a48dcfa14a558499c6bb0583b","description":"Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a\nrobust volume of texts of varied typology in contemporary Brazilian Portuguese\n(1970-).","downloads":1101,"tags":["task_categories:fill-mask","task_categories:text-generation","task_ids:masked-language-modeling","task_ids:language-modeling","annotations_creators:no-annotation","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:pt","license:cc-by-4.0","size_categories:1B<n<10B","arxiv:2303.16098","region:us"],"createdAt":"2022-03-28T13:30:33.000Z","key":""},{"_id":"6246e31fb215b9b324a8218f","id":"huggan/few-shot-obama","author":"huggan","disabled":false,"gated":false,"lastModified":"2022-04-12T14:05:43.000Z","likes":1,"trendingScore":1,"private":false,"sha":"5e44a58b8670e3c79a1a78efbd08ba3f3cbddfac","description":"\n\t\n\t\t\n\t\tCitation\n\t\n\n@article{DBLP:journals/corr/abs-2101-04775,\n  author    = {Bingchen Liu and\n               Yizhe Zhu and\n               Kunpeng Song and\n               Ahmed Elgammal},\n  title     = {Towards Faster and Stabilized {GAN} Training for High-fidelity Few-shot\n               Image Synthesis},\n  journal   = {CoRR},\n  volume    = {abs/2101.04775},\n  year      = {2021},\n  url       = {https://arxiv.org/abs/2101.04775},\n  eprinttype = {arXiv},\n  eprint    = {2101.04775},\n  timestamp… See the full description on the dataset page: https://huggingface.co/datasets/huggan/few-shot-obama.","downloads":72,"tags":["size_categories:n<1K","format:parquet","modality:image","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2101.04775","region:us"],"createdAt":"2022-04-01T11:33:51.000Z","key":""},{"_id":"624dc53eb4e6099df7613baf","id":"chainyo/rvl-cdip-invoice","author":"chainyo","disabled":false,"gated":false,"lastModified":"2022-04-06T16:57:20.000Z","likes":18,"trendingScore":1,"private":false,"sha":"fad615c9ceaecb4476b0a01f29c0a15b276b3a2b","description":"⚠️ This only a subpart of the original dataset, containing only invoice.\nThe RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.\nFor questions and comments please contact Adam Harley (aharley@scs.ryerson.ca).\nThe full dataset… See the full description on the dataset page: https://huggingface.co/datasets/chainyo/rvl-cdip-invoice.","downloads":158,"tags":["license:other","size_categories:10K<n<100K","format:parquet","modality:image","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-04-06T16:52:14.000Z","key":""},{"_id":"625e8e36d28969004c120d8b","id":"google/fleurs","author":"google","disabled":false,"gated":false,"lastModified":"2026-05-15T09:35:34.000Z","likes":420,"trendingScore":1,"private":false,"sha":"70bb2e84b976b7e960aa89f1c648e09c59f894dd","description":"\n\t\n\t\t\n\t\tFLEURS\n\t\n\nFleurs is the speech version of the FLoRes machine translation benchmark. \nWe use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. \nTraining sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is\nused and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.","downloads":74665,"tags":["task_categories:automatic-speech-recognition","annotations_creators:expert-generated","annotations_creators:crowdsourced","annotations_creators:machine-generated","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","language:afr","language:amh","language:ara","language:asm","language:ast","language:azj","language:bel","language:ben","language:bos","language:cat","language:ceb","language:cmn","language:ces","language:cym","language:dan","language:deu","language:ell","language:eng","language:spa","language:est","language:fas","language:ful","language:fin","language:tgl","language:fra","language:gle","language:glg","language:guj","language:hau","language:heb","language:hin","language:hrv","language:hun","language:hye","language:ind","language:ibo","language:isl","language:ita","language:jpn","language:jav","language:kat","language:kam","language:kea","language:kaz","language:khm","language:kan","language:kor","language:ckb","language:kir","language:ltz","language:lug","language:lin","language:lao","language:lit","language:luo","language:lav","language:mri","language:mkd","language:mal","language:mon","language:mar","language:msa","language:mlt","language:mya","language:nob","language:npi","language:nld","language:nso","language:nya","language:oci","language:orm","language:ory","language:pan","language:pol","language:pus","language:por","language:ron","language:rus","language:bul","language:snd","language:slk","language:slv","language:sna","language:som","language:srp","language:swe","language:swh","language:tam","language:tel","language:tgk","language:tha","language:tur","language:ukr","language:umb","language:urd","language:uzb","language:vie","language:wol","language:xho","language:yor","language:yue","language:zul","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2205.12446","arxiv:2106.03193","region:us","speech-recognition"],"createdAt":"2022-04-19T10:25:58.000Z","key":""},{"_id":"6261684d6dae705b25672d0d","id":"aharley/rvl_cdip","author":"aharley","disabled":false,"gated":false,"lastModified":"2024-09-10T13:49:12.000Z","likes":84,"trendingScore":1,"private":false,"sha":"03f14a4ad0a32413eff51ca10f9f511545f2bd5b","citation":"@inproceedings{harley2015icdar,\n    title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},\n    author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis},\n    booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}},\n    year = {2015}\n}","description":"The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images.","downloads":1003,"paperswithcode_id":"rvl-cdip","tags":["task_categories:image-classification","task_ids:multi-class-image-classification","annotations_creators:found","language_creators:found","multilinguality:monolingual","source_datasets:extended|iit_cdip","language:en","license:other","size_categories:100K<n<1M","arxiv:1502.07058","region:us"],"createdAt":"2022-04-21T14:21:01.000Z","key":""},{"_id":"626bb600974d6a67df6d035d","id":"nielsr/funsd-layoutlmv3","author":"nielsr","disabled":false,"gated":false,"lastModified":"2025-06-20T06:58:31.000Z","likes":42,"trendingScore":1,"private":false,"sha":"ccd2a77745b0dc9f154a91db1219ec05c86ce7ec","downloads":930,"tags":["size_categories:n<1K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-04-29T09:55:12.000Z","key":""},{"_id":"627cf66c4918f2e0b4313e92","id":"nickmuchi/rugd-dataset-all","author":"nickmuchi","disabled":false,"gated":false,"lastModified":"2022-05-12T13:37:05.000Z","likes":3,"trendingScore":1,"private":false,"sha":"689dccc2dd16d0c73bebc4dea422f1c314e4c666","downloads":128,"tags":["size_categories:1K<n<10K","format:parquet","modality:image","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-05-12T11:58:36.000Z","key":""},{"_id":"628162df58ca7bc54280a2c3","id":"mteb/amazon_massive_scenario","author":"mteb","disabled":false,"gated":false,"lastModified":"2025-05-04T16:08:05.000Z","likes":4,"trendingScore":1,"private":false,"sha":"58871793b91addb7c5f7afff26ccf08737fb6697","description":"\n  MassiveScenarioClassification\n  An MTEB dataset\n  Massive Text Embedding Benchmark\n\n\nMASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages\n\n\t\n\t\t\n\n\n\n\n\t\t\nTask category\nt2c\n\n\nDomains\nSpoken\n\nReference\nhttps://arxiv.org/abs/2204.08582\n\n\n\t\n\n\n\t\n\t\t\n\t\tHow to evaluate on this task\n\t\n\nYou can evaluate an embedding model on this dataset using the following code:\nimport mteb\n\ntask = mteb.get_tasks([\"MassiveScenarioClassification\"])\nevaluator =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/amazon_massive_scenario.","downloads":4207,"tags":["task_categories:text-classification","annotations_creators:human-annotated","multilinguality:translated","language:afr","language:amh","language:ara","language:aze","language:ben","language:cmo","language:cym","language:dan","language:deu","language:ell","language:eng","language:fas","language:fin","language:fra","language:heb","language:hin","language:hun","language:hye","language:ind","language:isl","language:ita","language:jav","language:jpn","language:kan","language:kat","language:khm","language:kor","language:lav","language:mal","language:mon","language:msa","language:mya","language:nld","language:nob","language:pol","language:por","language:ron","language:rus","language:slv","language:spa","language:sqi","language:swa","language:swe","language:tam","language:tel","language:tgl","language:tha","language:tur","language:urd","language:vie","license:apache-2.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","arxiv:2204.08582","arxiv:2502.13595","arxiv:2210.07316","region:us","mteb","text"],"createdAt":"2022-05-15T20:30:23.000Z","key":""},{"_id":"62865c33504d37700309f40b","id":"mteb/mtop_intent","author":"mteb","disabled":false,"gated":false,"lastModified":"2025-07-19T20:10:18.000Z","likes":3,"trendingScore":1,"private":false,"sha":"9a200bf94b01afa0438a39c54f6a3aceaf78856f","description":"\n  MTOPIntentClassification\n  An MTEB dataset\n  Massive Text Embedding Benchmark\n\n\nMTOP: Multilingual Task-Oriented Semantic Parsing\n\n\t\n\t\t\n\n\n\n\n\t\t\nTask category\nt2c\n\n\nDomains\nSpoken, Spoken\n\n\nReference\nhttps://arxiv.org/pdf/2008.09335.pdf\n\n\n\t\n\n\n\t\n\t\t\n\t\tHow to evaluate on this task\n\t\n\nYou can evaluate an embedding model on this dataset using the following code:\nimport mteb\n\ntask = mteb.get_tasks([\"MTOPIntentClassification\"])\nevaluator = mteb.MTEB(task)\n\nmodel = mteb.get_model(YOUR_MODEL)… See the full description on the dataset page: https://huggingface.co/datasets/mteb/mtop_intent.","downloads":1027,"tags":["task_categories:text-classification","annotations_creators:human-annotated","multilinguality:multilingual","language:deu","language:eng","language:fra","language:hin","language:spa","language:tha","license:unknown","modality:text","arxiv:2008.09335","arxiv:2502.13595","arxiv:2210.07316","region:us","mteb","text"],"createdAt":"2022-05-19T15:03:15.000Z","key":""},{"_id":"628ea3c59202a9acd20d640a","id":"valurank/News_Articles_Categorization","author":"valurank","disabled":false,"gated":false,"lastModified":"2023-08-27T05:49:31.000Z","likes":5,"trendingScore":1,"private":false,"sha":"e1d8bda201b4f7e71daa3d64d757e0cbebb40e76","description":"\n\t\n\t\t\n\t\tDataset Card for News_Articles_Categorization\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n3722 News Articles classified into different categories namely: World, Politics, Tech, Entertainment, Sport, Business, Health, and Science\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nThe text in the dataset is in English\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\nThe dataset consists of two columns namely Text and Category.\nThe Text column consists of the news article and the Category column consists of the class each article belongs to… See the full description on the dataset page: https://huggingface.co/datasets/valurank/News_Articles_Categorization.","downloads":98,"tags":["task_categories:text-classification","task_ids:multi-class-classification","multilinguality:monolingual","language:en","license:other","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-05-25T21:46:45.000Z","key":""},{"_id":"62a15454bff710e3fb1b10b1","id":"Bingsu/Human_Action_Recognition","author":"Bingsu","disabled":false,"gated":false,"lastModified":"2022-07-05T02:48:56.000Z","likes":31,"trendingScore":1,"private":false,"sha":"6c1a1284eb3557055d7c57b91cd7e68e3252b32c","description":"\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nA dataset from kaggle. origin: https://dphi.tech/challenges/data-sprint-76-human-activity-recognition/233/data\n\n\t\n\t\t\n\t\tIntroduction\n\t\n\n\nThe dataset features 15 different classes of Human Activities.\nThe dataset contains about 12k+ labelled images including the validation images.\nEach image has only one human activity category and are saved in separate folders of the labelled classes\n\n\n\t\n\t\t\n\t\tPROBLEM STATEMENT\n\t\n\n\nHuman Action Recognition (HAR) aims to understand… See the full description on the dataset page: https://huggingface.co/datasets/Bingsu/Human_Action_Recognition.","downloads":178,"tags":["task_categories:image-classification","source_datasets:original","language:en","license:odbl","size_categories:10K<n<100K","format:parquet","modality:image","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-06-09T02:00:52.000Z","key":""},{"_id":"62a2ca3e2f5c5547d54e4fcb","id":"AlekseyKorshuk/fantasy-books","author":"AlekseyKorshuk","disabled":false,"gated":"auto","lastModified":"2022-06-10T04:36:42.000Z","likes":11,"trendingScore":1,"private":false,"sha":"08958b8cc5c877b306a52ff260ec0f8a1b96421d","downloads":11,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-06-10T04:36:14.000Z","key":""},{"_id":"62a9dc9a471f7e0783124b0d","id":"codeparrot/apps","author":"codeparrot","disabled":false,"gated":false,"lastModified":"2022-10-20T15:00:15.000Z","likes":202,"trendingScore":1,"private":false,"sha":"21e74ddf8de1a21436da12e3e653065c5213e9d1","citation":"@article{hendrycksapps2021,\n  title={Measuring Coding Challenge Competence With APPS},\n  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},\n  journal={NeurIPS},\n  year={2021}\n}","description":"APPS is a benchmark for Python code generation, it includes 10,000 problems, which range from having simple oneline solutions to being substantial algorithmic challenges, for more details please refer to this paper: https://arxiv.org/pdf/2105.09938.pdf.","downloads":17424,"tags":["task_categories:text-generation","task_ids:language-modeling","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:monolingual","language:code","license:mit","arxiv:2105.09938","arxiv:2203.07814","region:us"],"createdAt":"2022-06-15T13:20:26.000Z","key":""},{"_id":"62b1c7836a5435fd9a5d2529","id":"FacePerceiver/laion-face","author":"FacePerceiver","disabled":false,"gated":false,"lastModified":"2022-11-18T04:04:56.000Z","likes":31,"trendingScore":1,"private":false,"sha":"1bacaa7115661e44f6d60aeaab8c641333334463","description":"\n\t\n\t\t\n\t\tLaion-Face\n\t\n\nLAION-Face is the human face subset of LAION-400M, it consists of 50 million image-text pairs. Face detection is conducted to find images with faces. Apart from the 50 million full-set(LAION-Face 50M), there is a 20 million sub-set(LAION-Face 20M) for fast evaluation. \nLAION-Face is first used as the training set of FaRL, which provides powerful pre-training transformer backbones for face analysis tasks.\nFor more details, please check the offical repo at… See the full description on the dataset page: https://huggingface.co/datasets/FacePerceiver/laion-face.","downloads":285,"tags":["region:us"],"createdAt":"2022-06-21T13:28:35.000Z","key":""},{"_id":"62b355ff8c8fac72c9700320","id":"nateraw/parti-prompts","author":"nateraw","disabled":false,"gated":false,"lastModified":"2022-06-22T19:17:49.000Z","likes":73,"trendingScore":1,"private":false,"sha":"944b156abfdad7627c3221b5ec4f6a6fb060a197","description":"\n\t\n\t\t\n\t\tDataset Card for PartiPrompts (P2)\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nPartiPrompts (P2) is a rich set of over 1600 prompts in English that we release\nas part of this work. P2 can be used to measure model capabilities across\nvarious categories and challenge aspects.\n\nP2 prompts can be simple, allowing us to gauge the progress from scaling. They\ncan also be complex, such as the following 67-word description we created for\nVincent van Gogh’s The Starry Night (1889):\nOil-on-canvas painting of a… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/parti-prompts.","downloads":2153,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-06-22T17:48:47.000Z","key":""},{"_id":"62bad5305e3622305c817671","id":"knkarthick/dialogsum","author":"knkarthick","disabled":false,"gated":false,"lastModified":"2023-10-03T10:56:21.000Z","likes":243,"trendingScore":1,"private":false,"sha":"a968e7aee0602e257935f1321a02e4287f7d5848","description":"\n\t\n\t\t\n\t\tDataset Card for DIALOGSum Corpus\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\t\n\t\t\n\t\tLinks\n\t\n\n\nHomepage: https://aclanthology.org/2021.findings-acl.449\nRepository: https://github.com/cylnlp/dialogsum\nPaper: https://aclanthology.org/2021.findings-acl.449\nPoint of Contact: https://huggingface.co/knkarthick\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nDialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.","downloads":7740,"tags":["task_categories:summarization","task_categories:text-generation","annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","dialogue-summary","one-liner-summary","meeting-title","email-subject"],"createdAt":"2022-06-28T10:17:20.000Z","key":""},{"_id":"62bc0c42c41fcaa97b5163e2","id":"knkarthick/samsum","author":"knkarthick","disabled":false,"gated":false,"lastModified":"2025-09-18T10:58:14.000Z","likes":41,"trendingScore":1,"private":false,"sha":"6b929ff10edec703164e3ddb2e94aae058c9ab5f","description":"\n\t\n\t\t\n\t\tDataset Card for SAMSum Corpus\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\t\n\t\t\n\t\tLinks\n\t\n\n\nHomepage: hhttps://arxiv.org/abs/1911.12237v2\nRepository: https://arxiv.org/abs/1911.12237v2\nPaper: https://arxiv.org/abs/1911.12237v2\nPoint of Contact: https://huggingface.co/knkarthick\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nThe SAMSum dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/samsum.","downloads":10120,"paperswithcode_id":"samsum-corpus","tags":["task_categories:summarization","annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-nc-nd-4.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:1911.12237","region:us","conversations-summarization"],"createdAt":"2022-06-29T08:24:34.000Z","key":""},{"_id":"62be6afc1e22ec8427aac2c7","id":"zh-plus/tiny-imagenet","author":"zh-plus","disabled":false,"gated":false,"lastModified":"2022-07-12T09:04:30.000Z","likes":100,"trendingScore":1,"private":false,"sha":"5a77092c28e51558c5586e9c5eb71a7e17a5e43f","description":"\n\t\n\t\t\n\t\tDataset Card for tiny-imagenet\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nTiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nThe class labels in the dataset are in English.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\n{\n  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x1A800E8E190,\n  'label': 15\n}… See the full description on the dataset page: https://huggingface.co/datasets/zh-plus/tiny-imagenet.","downloads":21561,"paperswithcode_id":"imagenet","tags":["task_categories:image-classification","task_ids:multi-class-image-classification","annotations_creators:crowdsourced","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:extended|imagenet-1k","language:en","size_categories:100K<n<1M","format:parquet","modality:image","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-07-01T03:33:16.000Z","key":""},{"_id":"62c287bdbe3a3ecd53733723","id":"holylovenia/TITML-IDN","author":"holylovenia","disabled":false,"gated":false,"lastModified":"2022-10-25T06:23:17.000Z","likes":7,"trendingScore":1,"private":false,"sha":"eb915043fa53039237e47183108b7aaf19b7da9e","description":"\n\t\n\t\t\n\t\tIndoLVCSR\n\t\n\nTITML-IDN (Tokyo Institute of Technology Multilingual - Indonesian) is collected and proposed by the authors of \"A Large Vocabulary Continuous Speech Recognition System for Indonesian Language\". The text transcriptions are obtained from newspaper and magazine articles. The speech is recorded from 20 speakers (11 males and 9 females).\n\n\t\n\t\t\n\t\tHow to cite\n\t\n\nIf you use this dataset, you have to cite this paper:\n@inproceedings{lestari2006titmlidn,\n  title={A large vocabulary… See the full description on the dataset page: https://huggingface.co/datasets/holylovenia/TITML-IDN.","downloads":53,"tags":["task_categories:automatic-speech-recognition","annotations_creators:expert-generated","language_creators:found","multilinguality:monolingual","source_datasets:original","language:id","license:other","size_categories:1K<n<10K","format:audiofolder","modality:audio","library:datasets","library:mlcroissant","region:us","speech-recognition"],"createdAt":"2022-07-04T06:25:01.000Z","key":""},{"_id":"62cf350a180d2ba1cd013ead","id":"facebook/flores","author":"facebook","disabled":false,"gated":"auto","lastModified":"2026-05-29T08:38:45.000Z","likes":106,"trendingScore":1,"private":false,"sha":"71abf77d8b7beb5cfef59898d6b24d92ab7654fc","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for Flores 200\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\n⚠️ This repository is no longer being updated ⚠️\nA newer version of the FLORES dataset managed by the Open Language Data Initiative\nis available at https://huggingface.co/datasets/openlanguagedata/flores_plus.\nFLORES is a benchmark dataset for machine translation between English and low-resource languages.\n\nThe creation of FLORES-200 doubles the existing language coverage of FLORES-101.\nGiven the nature of the new… See the full description on the dataset page: https://huggingface.co/datasets/facebook/flores.","downloads":6514,"paperswithcode_id":"flores","tags":["task_categories:text-generation","task_categories:translation","annotations_creators:found","language_creators:expert-generated","multilinguality:multilingual","multilinguality:translation","source_datasets:extended|flores","language:ace","language:acm","language:acq","language:aeb","language:af","language:ajp","language:ak","language:als","language:am","language:apc","language:ar","language:ars","language:ary","language:arz","language:as","language:ast","language:awa","language:ayr","language:azb","language:azj","language:ba","language:bm","language:ban","language:be","language:bem","language:bn","language:bho","language:bjn","language:bo","language:bs","language:bug","language:bg","language:ca","language:ceb","language:cs","language:cjk","language:ckb","language:crh","language:cy","language:da","language:de","language:dik","language:dyu","language:dz","language:el","language:en","language:eo","language:et","language:eu","language:ee","language:fo","language:fj","language:fi","language:fon","language:fr","language:fur","language:fuv","language:gaz","language:gd","language:ga","language:gl","language:gn","language:gu","language:ht","language:ha","language:he","language:hi","language:hne","language:hr","language:hu","language:hy","language:ig","language:ilo","language:id","language:is","language:it","language:jv","language:ja","language:kab","language:kac","language:kam","language:kn","language:ks","language:ka","language:kk","language:kbp","language:kea","language:khk","language:km","language:ki","language:rw","language:ky","language:kmb","language:kmr","language:knc","language:kg","language:ko","language:lo","language:lij","language:li","language:ln","language:lt","language:lmo","language:ltg","language:lb","language:lua","language:lg","language:luo","language:lus","language:lvs","language:mag","language:mai","language:ml","language:mar","language:min","language:mk","language:mt","language:mni","language:mos","language:mi","language:my","language:nl","language:nn","language:nb","language:npi","language:nso","language:nus","language:ny","language:oc","language:ory","language:pag","language:pa","language:pap","language:pbt","language:pes","language:plt","language:pl","language:pt","language:prs","language:quy","language:ro","language:rn","language:ru","language:sg","language:sa","language:sat","language:scn","language:shn","language:si","language:sk","language:sl","language:sm","language:sn","language:sd","language:so","language:st","language:es","language:sc","language:sr","language:ss","language:su","language:sv","language:swh","language:szl","language:ta","language:taq","language:tt","language:te","language:tg","language:tl","language:th","language:ti","language:tpi","language:tn","language:ts","language:tk","language:tum","language:tr","language:tw","language:tzm","language:ug","language:uk","language:umb","language:ur","language:uzn","language:vec","language:vi","language:war","language:wo","language:xh","language:ydd","language:yo","language:yue","language:zh","language:zsm","language:zu","license:cc-by-sa-4.0","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2207.04672","arxiv:1902.01382","region:us","conditional-text-generation"],"createdAt":"2022-07-13T21:11:38.000Z","key":""},{"_id":"62d73facacc7c69fbea6bb42","id":"naver-clova-ix/cord-v2","author":"naver-clova-ix","disabled":false,"gated":false,"lastModified":"2022-07-19T23:43:33.000Z","likes":124,"trendingScore":1,"private":false,"sha":"7f0115a4b758a71d6473b8d085751692da2fef98","downloads":10644,"tags":["license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-07-19T23:35:08.000Z","key":""},{"_id":"62e318c247678ea5ce20ede4","id":"ICML2022/ProteinGym","author":"ICML2022","disabled":false,"gated":false,"lastModified":"2022-07-29T00:19:31.000Z","likes":12,"trendingScore":1,"private":false,"sha":"e936ae69e3c70ff651d47889a389de6f596863b2","description":"\n\t\n\t\t\n\t\tProteinGym benchmarks overview\n\t\n\nProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays curated to enable thorough comparisons of various mutation effect predictors indifferent regimes. It is comprised of two benchmarks: 1) a substitution benchmark which consists of the experimental characterisation of ∼1.5M missense variants across 87 DMS assays 2) an indel benchmark that includes ∼300k mutants across 7 DMS assays.\nEach processed file in each benchmark corresponds to… See the full description on the dataset page: https://huggingface.co/datasets/ICML2022/ProteinGym.","downloads":1295,"tags":["arxiv:2205.13760","region:us"],"createdAt":"2022-07-28T23:16:18.000Z","key":""},{"_id":"62f178d43e0991a8ab14c6af","id":"hoskinson-center/proof-pile","author":"hoskinson-center","disabled":false,"gated":false,"lastModified":"2023-08-19T03:24:11.000Z","likes":69,"trendingScore":1,"private":false,"sha":"490b980249446f2f3bd2df3a8cf085d0f2de240a","citation":"@InProceedings{huggingface:dataset,\ntitle = {proof-pile},\nauthor={Zhangir Azerbayev, Edward Ayers, Bartosz Piotrowski\n},\nyear={2022}\n}","description":"A dataset of high quality mathematical text.","downloads":1586,"tags":["task_categories:text-generation","task_ids:language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","language:en","license:apache-2.0","size_categories:100K<n<1M","modality:text","library:datasets","library:mlcroissant","region:us","math","mathematics","formal-mathematics"],"createdAt":"2022-08-08T20:57:56.000Z","key":""},{"_id":"62f857a7c5cf61a014328e88","id":"allenai/nllb","author":"allenai","disabled":false,"gated":false,"lastModified":"2022-09-29T18:53:15.000Z","likes":109,"trendingScore":1,"private":false,"sha":"c36967abb45f06ff7659849372ab41e01838193e","description":"\n\t\n\t\t\n\t\tDataset Card for No Language Left Behind (NLLB - 200vo)\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset was created based on metadata for mined bitext released by Meta AI.  It contains bitext for 148 English-centric and 1465 non-English-centric language pairs using the stopes mining library and the LASER3 encoders (Heffernan et al., 2022). The complete dataset is ~450GB.\nCCMatrix contains previous versions of mined instructions.\n\n\t\n\t\t\n\t\n\t\n\t\tHow to use the data\n\t\n\nThere are two ways to… See the full description on the dataset page: https://huggingface.co/datasets/allenai/nllb.","downloads":591,"tags":["arxiv:2207.0467","arxiv:2205.12654","arxiv:2207.04672","region:us"],"createdAt":"2022-08-14T02:02:15.000Z","key":""},{"_id":"62fd4ff64723285c5e151be0","id":"allenai/real-toxicity-prompts","author":"allenai","disabled":false,"gated":false,"lastModified":"2022-09-30T14:23:19.000Z","likes":122,"trendingScore":1,"private":false,"sha":"f21629712ffd6a3d13a54fd2807ccd521c55ef74","description":"\n\t\n\t\t\n\t\tDataset Card for Real Toxicity Prompts\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nRealToxicityPrompts is a dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nEnglish\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\nEach instance represents a prompt and its metadata:\n{\n  \"filename\":\"0766186-bc7f2a64cb271f5f56cf6f25570cd9ed.txt\",\n  \"begin\":340,\n  \"end\":564,\n  \"challenging\":false… See the full description on the dataset page: https://huggingface.co/datasets/allenai/real-toxicity-prompts.","downloads":10897,"tags":["multilinguality:monolingual","source_datasets:original","language:en","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2009.11462","doi:10.57967/hf/0002","region:us"],"createdAt":"2022-08-17T20:30:46.000Z","key":""},{"_id":"630a84ffe81e1dea2cef3404","id":"priyank-m/SROIE_2019_text_recognition","author":"priyank-m","disabled":false,"gated":false,"lastModified":"2022-08-27T21:38:24.000Z","likes":13,"trendingScore":1,"private":false,"sha":"04f6537e418eeb88863d617eb27817cc496522d7","description":"This dataset we prepared using the Scanned receipts OCR and information extraction(SROIE) dataset.\nThe SROIE dataset contains 973 scanned receipts in English language.\nCropping the bounding boxes from each of the receipts to generate this text-recognition dataset resulted in 33626 images for train set and 18704 images for the test set.\nThe text annotations for all the images inside a split are stored in a metadata.jsonl file.\nusage: \nfrom dataset import load_dataset\ndata =… See the full description on the dataset page: https://huggingface.co/datasets/priyank-m/SROIE_2019_text_recognition.","downloads":316,"tags":["task_categories:image-to-text","task_ids:image-captioning","multilinguality:monolingual","language:en","size_categories:10K<n<100K","format:imagefolder","modality:image","modality:text","library:datasets","library:mlcroissant","region:us","text-recognition","recognition"],"createdAt":"2022-08-27T20:56:31.000Z","key":""},{"_id":"63112274d43c55e811f9cd63","id":"zeroshot/twitter-financial-news-sentiment","author":"zeroshot","disabled":false,"gated":false,"lastModified":"2024-02-23T19:04:10.000Z","likes":174,"trendingScore":1,"private":false,"sha":"ccbe24de388e287beb92dd393a335c376b350ac3","description":"\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThe Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their sentiment.\n\nThe dataset holds 11,932 documents annotated with 3 labels:\n\nsentiments = {\n    \"LABEL_0\": \"Bearish\", \n    \"LABEL_1\": \"Bullish\", \n    \"LABEL_2\": \"Neutral\"\n}  \n\nThe data was collected using the Twitter API. The current dataset supports the multi-class classification… See the full description on the dataset page: https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment.","downloads":4395,"tags":["task_categories:text-classification","task_ids:multi-class-classification","annotations_creators:other","language_creators:other","multilinguality:monolingual","source_datasets:original","language:en","license:mit","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","twitter","finance","markets","stocks","wallstreet","quant","hedgefunds"],"createdAt":"2022-09-01T21:21:56.000Z","key":""},{"_id":"6325d153ff539edeea8f654a","id":"slone/myv_ru_2022","author":"slone","disabled":false,"gated":false,"lastModified":"2025-05-16T13:25:22.000Z","likes":3,"trendingScore":1,"private":false,"sha":"1afa70705c8d2886060dfb97be9bfe095aba6ecc","description":"\n\t\n\t\t\n\t\tDataset Card for slone/myv_ru_2022\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis is a corpus of parallel Erzya-Russian words, phrases and sentences, collected in the paper The first neural machine translation system for the Erzya language.\nErzya (myv) is a language from the Uralic family. It is spoken primarily in the Republic of Mordovia and some other regions of Russia and other post-Soviet countries. We use the Cyrillic version of its script. \nThe corpus consists of the following parts:… See the full description on the dataset page: https://huggingface.co/datasets/slone/myv_ru_2022.","downloads":60,"tags":["task_categories:translation","annotations_creators:found","annotations_creators:machine-generated","language_creators:found","language_creators:machine-generated","multilinguality:translation","source_datasets:original","language:myv","language:ru","license:cc-by-sa-4.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2209.09368","region:us","erzya","mordovian"],"createdAt":"2022-09-17T13:53:23.000Z","key":""},{"_id":"6327a0abf0e99f96e029fda0","id":"yerevann/coco-karpathy","author":"yerevann","disabled":false,"gated":false,"lastModified":"2022-10-31T11:24:01.000Z","likes":22,"trendingScore":1,"private":false,"sha":"448fdb1bc7b2d09e46881c4541a14d796a3d41e8","description":"\n\t\n\t\t\n\t\tDataset Card for \"yerevann/coco-karpathy\"\n\t\n\nThe Karpathy split of COCO for image captioning.\n","downloads":3615,"tags":["task_categories:image-to-text","task_ids:image-captioning","language:en","size_categories:100K<n<1M","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","coco","image-captioning"],"createdAt":"2022-09-18T22:50:19.000Z","key":""},{"_id":"632c39f416faa31b24d8ba15","id":"detection-datasets/fashionpedia","author":"detection-datasets","disabled":false,"gated":false,"lastModified":"2022-09-22T13:22:02.000Z","likes":84,"trendingScore":1,"private":false,"sha":"80845435ce686b8a9dbf70a05452fbfb8e09cdd7","description":"\n\t\n\t\t\n\t\tDataset Card for Fashionpedia\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nFashionpedia is a dataset mapping out the visual aspects of the fashion world.\nFrom the paper:\n\nFashionpedia is a new dataset which consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with everyday and celebrity event fashion images annotated with segmentation masks and their associated… See the full description on the dataset page: https://huggingface.co/datasets/detection-datasets/fashionpedia.","downloads":809,"paperswithcode_id":"fashionpedia","tags":["task_categories:object-detection","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:image","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2004.12276","region:us","object-detection","fashion","computer-vision"],"createdAt":"2022-09-22T10:33:24.000Z","key":""},{"_id":"633b28dac9b44f5c6ac48611","id":"RamAnanth1/lex-fridman-podcasts","author":"RamAnanth1","disabled":false,"gated":false,"lastModified":"2022-12-17T21:39:56.000Z","likes":6,"trendingScore":1,"private":false,"sha":"a5599d85efeeffeab2c512a02ced7c7a5bae05f2","description":"\n\t\n\t\t\n\t\tDataset Card for Lex Fridman Podcasts Dataset\n\t\n\nThis dataset is sourced from Andrej Karpathy's Lexicap website which contains English transcripts of Lex Fridman's wonderful podcast episodes. The transcripts were generated using OpenAI's large-sized Whisper model\n","downloads":12,"tags":["task_categories:text-classification","task_categories:text-generation","task_categories:summarization","task_ids:sentiment-analysis","task_ids:dialogue-modeling","task_ids:language-modeling","language_creators:found","multilinguality:monolingual","language:en","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-10-03T18:24:26.000Z","key":""},{"_id":"633f151f33ba83e00bd74c15","id":"bigcode/the-stack-dedup","author":"bigcode","disabled":false,"gated":"auto","lastModified":"2023-08-17T08:21:58.000Z","likes":398,"trendingScore":1,"private":false,"sha":"17cad72c886a2858e08d4c349a00d6466f54df63","description":"\n\t\n\t\t\n\t\tDataset Card for The Stack\n\t\n\n\n\n\t\n\t\t\n\t\tChangelog\n\t\n\n\n\t\n\t\t\nRelease\nDescription\n\n\n\t\t\nv1.0\nInitial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 1.5TB in size.\n\n\nv1.1\nThe three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-dedup.","downloads":17329,"tags":["task_categories:text-generation","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","language:code","license:other","size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2211.15533","arxiv:2107.03374","arxiv:2207.14157","region:us"],"createdAt":"2022-10-06T17:49:19.000Z","key":""},{"_id":"6347c043dd46870a5aebec30","id":"nikitam/ACES","author":"nikitam","disabled":false,"gated":false,"lastModified":"2025-09-25T06:56:00.000Z","likes":16,"trendingScore":1,"private":false,"sha":"b497a6456957a5660ac20b8cac5b5222eb9b669c","description":"\n\t\n\t\t\n\t\tDataset Card for ACES and Span-ACES\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nACES consists of 36,476 examples covering 146 language pairs and representing challenges from 68 phenomena for evaluating machine translation metrics. We focus on translation accuracy errors and base the phenomena covered in our challenge set on the Multidimensional Quality Metrics (MQM) ontology. The phenomena range from simple perturbations at the word/character level to more complex errors based on discourse and… See the full description on the dataset page: https://huggingface.co/datasets/nikitam/ACES.","downloads":239,"tags":["task_categories:translation","multilinguality:multilingual","source_datasets:FLORES-101, FLORES-200, PAWS-X, XNLI, XTREME, WinoMT, Wino-X, MuCOW, EuroParl ConDisco, ParcorFull","language:multilingual","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2401.16313","region:us"],"createdAt":"2022-10-13T07:37:39.000Z","key":""},{"_id":"634e3479038b5879133500ae","id":"julianmoraes/doodles-captions-manual","author":"julianmoraes","disabled":false,"gated":false,"lastModified":"2022-10-18T05:07:46.000Z","likes":8,"trendingScore":1,"private":false,"sha":"a575ac6bf49f47ddd90be60f69071769827d25de","downloads":47,"tags":["size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-10-18T05:07:05.000Z","key":""},{"_id":"6361801a6483eb3832c8dc06","id":"ashraq/fashion-product-images-small","author":"ashraq","disabled":false,"gated":false,"lastModified":"2022-11-01T20:25:52.000Z","likes":42,"trendingScore":1,"private":false,"sha":"3859c76db2f6f3d3b9a3863345e3ccdbff75879d","description":"\n\t\n\t\t\n\t\tDataset Card for \"fashion-product-images-small\"\n\t\n\nMore Information needed\nData was obtained from here\n","downloads":1517,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2022-11-01T20:22:50.000Z","key":""},{"_id":"63670fb264bcbbd03e389b20","id":"VietAI/vi_pubmed","author":"VietAI","disabled":false,"gated":false,"lastModified":"2024-01-09T10:03:00.000Z","likes":24,"trendingScore":1,"private":false,"sha":"decfdcc57efa83466449ccaa658ad431a8a416d4","description":"\n\t\n\t\t\n\t\tDataset Summary\n\t\n\n20M Vietnamese PubMed biomedical abstracts translated by the state-of-the-art English-Vietnamese Translation project. The data has been used as unlabeled dataset for pretraining a Vietnamese Biomedical-domain Transformer model.\n\nimage source: Enriching Biomedical Knowledge for Vietnamese Low-resource Language Through Large-Scale Translation\n\n\t\n\t\n\t\n\t\tLanguage\n\t\n\n\nEnglish: Original biomedical abstracts from Pubmed\nVietnamese: Synthetic abstract translated by a… See the full description on the dataset page: https://huggingface.co/datasets/VietAI/vi_pubmed.","downloads":973,"paperswithcode_id":"pubmed","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","language:vi","language:en","license:cc","size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2210.05610","arxiv:2210.05598","region:us"],"createdAt":"2022-11-06T01:36:50.000Z","key":""},{"_id":"636aff63c95145940b018bdd","id":"pacovaldez/stackoverflow-questions","author":"pacovaldez","disabled":false,"gated":false,"lastModified":"2022-11-10T00:14:37.000Z","likes":50,"trendingScore":1,"private":false,"sha":"869802e52b4dfa074d8a8e255ce85580711cdc25","description":"\n\t\n\t\t\n\t\tDataset Card for [Stackoverflow Post Questions]\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nCompanies that sell Open-source software tools usually hire an army of Customer representatives to try to answer every question asked about their tool. The first step in this process \nis the prioritization of the question. The classification scale usually consists of 4 values, P0, P1, P2, and P3, with different meanings across every participant in the industry. On \nthe other hand, every software developer… See the full description on the dataset page: https://huggingface.co/datasets/pacovaldez/stackoverflow-questions.","downloads":987,"tags":["task_categories:text-classification","task_ids:multi-class-classification","annotations_creators:machine-generated","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:apache-2.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","stackoverflow","technical questions"],"createdAt":"2022-11-09T01:16:19.000Z","key":""},{"_id":"63716a5567cd0e88150491b9","id":"bigbio/bc5cdr","author":"bigbio","disabled":false,"gated":false,"lastModified":"2025-01-14T19:05:31.000Z","likes":14,"trendingScore":1,"private":false,"sha":"6ba1463320d003e8232dfa0c3ee9aaa2559998ec","citation":"@article{DBLP:journals/biodb/LiSJSWLDMWL16,\n  author    = {Jiao Li and\n               Yueping Sun and\n               Robin J. Johnson and\n               Daniela Sciaky and\n               Chih{-}Hsuan Wei and\n               Robert Leaman and\n               Allan Peter Davis and\n               Carolyn J. Mattingly and\n               Thomas C. Wiegers and\n               Zhiyong Lu},\n  title     = {BioCreative {V} {CDR} task corpus: a resource for chemical disease\n               relation extraction},\n  journal   = {Database J. Biol. Databases Curation},\n  volume    = {2016},\n  year      = {2016},\n  url       = {https://doi.org/10.1093/database/baw068},\n  doi       = {10.1093/database/baw068},\n  timestamp = {Thu, 13 Aug 2020 12:41:41 +0200},\n  biburl    = {https://dblp.org/rec/journals/biodb/LiSJSWLDMWL16.bib},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}","description":"The BioCreative V Chemical Disease Relation (CDR) dataset is a large annotated text corpus of human annotations of all chemicals, diseases and their interactions in 1,500 PubMed articles.","downloads":2960,"tags":["multilinguality:monolingual","language:en","license:other","region:us"],"createdAt":"2022-11-13T22:06:13.000Z","key":""},{"_id":"63716ac891284164d1b8fb27","id":"bigbio/ddi_corpus","author":"bigbio","disabled":false,"gated":false,"lastModified":"2022-12-22T15:44:31.000Z","likes":11,"trendingScore":1,"private":false,"sha":"da8e94986a0c689095b22bed134248b11f9311c7","citation":"@article{HERREROZAZO2013914,\n  title        = {\n    The DDI corpus: An annotated corpus with pharmacological substances and\n    drug-drug interactions\n  },\n  author       = {\n    María Herrero-Zazo and Isabel Segura-Bedmar and Paloma Martínez and Thierry\n    Declerck\n  },\n  year         = 2013,\n  journal      = {Journal of Biomedical Informatics},\n  volume       = 46,\n  number       = 5,\n  pages        = {914--920},\n  doi          = {https://doi.org/10.1016/j.jbi.2013.07.011},\n  issn         = {1532-0464},\n  url          = {https://www.sciencedirect.com/science/article/pii/S1532046413001123},\n  keywords     = {Biomedical corpora, Drug interaction, Information extraction}\n}","description":"The DDI corpus has been manually annotated with drugs and pharmacokinetics and pharmacodynamics interactions. It contains 1025 documents from two different sources: DrugBank database and MedLine.","downloads":321,"tags":["multilinguality:monolingual","language:en","license:cc-by-nc-4.0","size_categories:1K<n<10K","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2022-11-13T22:08:08.000Z","key":""},{"_id":"63a580bc062d8d60ff780ece","id":"misterkirill/ru-wikipedia","author":"misterkirill","disabled":false,"gated":false,"lastModified":"2022-12-23T10:20:31.000Z","likes":1,"trendingScore":1,"private":false,"sha":"5d66cb5ece2af653afa51845ce7569d27e42a612","downloads":138,"tags":["license:mit","size_categories:100K<n<1M","format:text","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2022-12-23T10:19:40.000Z","key":""},{"_id":"63b0f0a761c6eafc31a9169f","id":"keremberke/license-plate-object-detection","author":"keremberke","disabled":false,"gated":false,"lastModified":"2023-01-18T20:37:51.000Z","likes":38,"trendingScore":1,"private":false,"sha":"a51194c739991abb50ac8afe14704aa99a66cf51","citation":"@misc{ vehicle-registration-plates-trudk_dataset,\n    title = { Vehicle Registration Plates Dataset },\n    type = { Open Source Dataset },\n    author = { Augmented Startups },\n    howpublished = { \\\\url{ https://universe.roboflow.com/augmented-startups/vehicle-registration-plates-trudk } },\n    url = { https://universe.roboflow.com/augmented-startups/vehicle-registration-plates-trudk },\n    journal = { Roboflow Universe },\n    publisher = { Roboflow },\n    year = { 2022 },\n    month = { jun },\n    note = { visited on 2023-01-18 },\n}","description":"\n  \n\n\n\n\t\n\t\t\n\t\tDataset Labels\n\t\n\n['license_plate']\n\n\n\t\n\t\t\n\t\tNumber of Images\n\t\n\n{'train': 6176, 'valid': 1765, 'test': 882}\n\n\n\t\n\t\t\n\t\n\t\n\t\tHow to Use\n\t\n\n\nInstall datasets:\n\npip install datasets\n\n\nLoad the dataset:\n\nfrom datasets import load_dataset\n\nds = load_dataset(\"keremberke/license-plate-object-detection\", name=\"full\")\nexample = ds['train'][0]\n\n\n\t\n\t\t\n\t\tRoboflow Dataset Page\n\t\n\nhttps://universe.roboflow.com/augmented-startups/vehicle-registration-plates-trudk/dataset/1\n\n\t\n\t\t\n\t\tCitation… See the full description on the dataset page: https://huggingface.co/datasets/keremberke/license-plate-object-detection.","downloads":922,"tags":["task_categories:object-detection","size_categories:1K<n<10K","modality:image","modality:text","library:datasets","library:mlcroissant","region:us","roboflow","roboflow2huggingface","Self Driving","Anpr"],"createdAt":"2023-01-01T02:32:07.000Z","key":""},{"_id":"63b35293d6c6529ede7d022c","id":"theatticusproject/cuad","author":"theatticusproject","disabled":false,"gated":false,"lastModified":"2023-01-02T22:36:46.000Z","likes":38,"trendingScore":1,"private":false,"sha":"a3c393f5d103fd0c516374e4fdff676c8176dcb1","downloads":5854,"tags":["license:cc-by-4.0","size_categories:10K<n<100K","format:text","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-01-02T21:54:27.000Z","key":""},{"_id":"63d244b2dca71c5f885279d3","id":"Dipl0/Cours_QA_MK_2","author":"Dipl0","disabled":false,"gated":false,"lastModified":"2023-01-26T09:27:43.000Z","likes":1,"trendingScore":1,"private":false,"sha":"cb8faede254ce86c6d77c89e48a1d11c4da8022b","downloads":13,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-01-26T09:15:30.000Z","key":""},{"_id":"63d845f6143d89ad809480a9","id":"intronhealth/afrispeech-200","author":"intronhealth","disabled":false,"gated":false,"lastModified":"2023-11-20T09:20:34.000Z","likes":34,"trendingScore":1,"private":false,"sha":"b538c6e111914a812af28ff677f8cffc9b404b7d","citation":"TBD","description":"AFRISPEECH-200 is a 200hr Pan-African speech corpus for clinical and general domain English accented ASR; \na dataset with 120 African accents from 13 countries and 2,463 unique African speakers. \nOur goal is to raise awareness for and advance Pan-African English ASR research, \nespecially for the clinical domain.","downloads":886,"tags":["task_categories:automatic-speech-recognition","annotations_creators:expert-generated","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","arxiv:2310.00274","region:us"],"createdAt":"2023-01-30T22:34:30.000Z","key":""},{"_id":"63da45beaa68107243466309","id":"gsdf/EasyNegative","author":"gsdf","disabled":false,"gated":false,"lastModified":"2023-02-12T14:39:30.000Z","likes":1194,"trendingScore":1,"private":false,"sha":"60067b257337df8d7879142d870944fe4c6ab20d","description":"\n\t\n\t\t\n\t\tNegative Embedding\n\t\n\nThis is a Negative Embedding trained with Counterfeit. Please use it in the \"\\stable-diffusion-webui\\embeddings\" folder.It can be used with other models, but the effectiveness is not certain.  \n\n\t\n\t\t\n\t\tCounterfeit-V2.0.safetensors\n\t\n\n\n\n\t\n\t\t\n\t\tAbyssOrangeMix2_sfw.safetensors\n\t\n\n\n\n\t\n\t\t\n\t\tanything-v4.0-pruned.safetensors\n\t\n\n\n","downloads":23743,"tags":["license:other","size_categories:n<1K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-02-01T10:58:06.000Z","key":""},{"_id":"63dbc14d8616f0acd3a758f5","id":"Dipl0/QA_SMART_FULL_V0.1","author":"Dipl0","disabled":false,"gated":false,"lastModified":"2023-02-02T14:00:03.000Z","likes":1,"trendingScore":1,"private":false,"sha":"1c88de69a7e0d1abf5da210dfb0f00f48e4c4634","downloads":13,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-02-02T13:57:33.000Z","key":""},{"_id":"63e4a9f6d69072437eb83642","id":"heegyu/news-category-dataset","author":"heegyu","disabled":false,"gated":false,"lastModified":"2023-02-09T08:10:48.000Z","likes":3,"trendingScore":1,"private":false,"sha":"304a05a55bc6abc0446d8fae0d0771716b6a271a","description":"Dataset from https://www.kaggle.com/datasets/rmisra/news-category-dataset\n","downloads":1066,"tags":["license:cc-by-4.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-02-09T08:08:22.000Z","key":""},{"_id":"63e6bf7a5c3664766ed48fdb","id":"breadlicker45/100k-websites","author":"breadlicker45","disabled":false,"gated":false,"lastModified":"2023-02-10T22:05:35.000Z","likes":2,"trendingScore":1,"private":false,"sha":"1f8e8fd492a1bca73d6999f7ec3fd8366363ba78","description":"from https://domains-index.com/\n","downloads":10,"tags":["region:us"],"createdAt":"2023-02-10T22:04:42.000Z","key":""},{"_id":"63ea428f5de6361c8dd3cbcf","id":"liwu/MNBVC","author":"liwu","disabled":false,"gated":false,"lastModified":"2026-03-26T15:33:09.000Z","likes":634,"trendingScore":1,"private":false,"sha":"cab9d03acb94815b7205d3fd3a9cfa37473853cf","citation":"\\","description":"MNBVC: Massive Never-ending BT Vast Chinese corpus","downloads":128304,"tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:other","language_creators:other","multilinguality:monolingual","source_datasets:original","language:zh","license:mit","region:us"],"createdAt":"2023-02-13T14:00:47.000Z","key":""},{"_id":"63eeddb47b7380ffc6a4d992","id":"dirtycomputer/yf_dianping","author":"dirtycomputer","disabled":false,"gated":false,"lastModified":"2023-02-21T02:44:18.000Z","likes":4,"trendingScore":1,"private":false,"sha":"50cab63d6567d7900287799aff75797d30eb3556","downloads":142,"tags":["size_categories:1M<n<10M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-02-17T01:51:48.000Z","key":""},{"_id":"63ef56ab02c246819b98aee1","id":"Loie/VGGSound","author":"Loie","disabled":false,"gated":false,"lastModified":"2023-03-26T13:25:40.000Z","likes":55,"trendingScore":1,"private":false,"sha":"2ca3012ef85a60143a8f97b83a45bb1a7b5c2244","description":"\n\t\n\t\t\n\t\tVGGSound\n\t\n\nVGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube.\n\nHomepage: https://www.robots.ox.ac.uk/~vgg/data/vggsound/\nPaper: https://arxiv.org/abs/2004.14368\nGithub: https://github.com/hche11/VGGSound\n\n\n\t\n\t\t\n\t\n\t\n\t\tAnalysis\n\t\n\n\n310+ classes: VGG-Sound contains audios spanning a large number of challenging acoustic environments and noise characteristics of real applications.\n200,000+ videos: All… See the full description on the dataset page: https://huggingface.co/datasets/Loie/VGGSound.","downloads":2525,"tags":["task_categories:audio-classification","size_categories:n<1K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","arxiv:2004.14368","region:us"],"createdAt":"2023-02-17T10:27:55.000Z","key":""},{"_id":"63f18f9b5b11a9c3ca79324d","id":"theblackcat102/alexa-qa","author":"theblackcat102","disabled":false,"gated":false,"lastModified":"2023-02-19T04:14:43.000Z","likes":4,"trendingScore":1,"private":false,"sha":"89706412f40a2bc65cc66c1096ee5c1412a50a75","description":"\n\t\n\t\t\n\t\tAlexa Answers from alexaanswers.amazon.com\n\t\n\nThe Alexa Answers community helps to improve Alexa’s knowledge and answer questions asked by Alexa users. Which contains some very quirky and hard question like\nQ: what percent of the population has blackhair\nA: The most common hair color in the world is black and its found in wide array of background and ethnicities. About 75 to 85% of the global population has either black hair or the deepest brown shade.\nQ: what was the world population… See the full description on the dataset page: https://huggingface.co/datasets/theblackcat102/alexa-qa.","downloads":188,"tags":["task_categories:question-answering","language:en","license:mit","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","alexa"],"createdAt":"2023-02-19T02:55:23.000Z","key":""},{"_id":"63f261fe3403f45dc6201fae","id":"vietgpt/news_summarization_vi","author":"vietgpt","disabled":false,"gated":false,"lastModified":"2023-07-04T05:30:39.000Z","likes":11,"trendingScore":1,"private":false,"sha":"a0370b23f9821af3d19f6c52ba5cc79dee64a0ac","description":"\n\t\n\t\t\n\t\tSummarization\n\t\n\n\nSource: https://github.com/binhvq/news-corpus\nLanguage: Vietnamese\nLabeling: text-davinci-003\nNum examples:\n65,361 (train)\n10,000 (test)\n\n\n\nfrom datasets import load_dataset\n\nload_dataset(\"tdtunlp/news_summarization_vi\")\n\n\nFormat for Summarization task\n\ndef preprocess(\n    sample,\n    sep_key=\"<|endofprompt|>\",\n    end_key=\"<|endoftext|>\",\n):\n    article = sample['content']\n    completion = sample['summary']\n    return {'text': \"\"\"{article}\nTL;DR: \n{sep_key}… See the full description on the dataset page: https://huggingface.co/datasets/vietgpt/news_summarization_vi.","downloads":213,"tags":["task_categories:summarization","language:vi","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","LM"],"createdAt":"2023-02-19T17:53:02.000Z","key":""},{"_id":"63fea2ec7a7f94a38ae7a365","id":"MultiCoNER/multiconer_v2","author":"MultiCoNER","disabled":false,"gated":false,"lastModified":"2023-07-06T18:37:15.000Z","likes":19,"trendingScore":1,"private":false,"sha":"4be2d62c912977ee26ed14d2553a4fe17ca3d980","citation":"@inproceedings{multiconer2-report,\n    title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},\n    author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},\n    booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},\n    year={2023},\n    publisher={Association for Computational Linguistics},\n}\n\n@article{multiconer2-data,\n    title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},\n    author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},\n    year={2023},\n}","description":"Complex named entities (NE), like the titles of creative works, are not simple nouns and pose challenges for NER systems (Ashwini and Choi, 2014). They can take the form of any linguistic constituent, like an imperative clause (“Dial M for Murder”), and do not look like traditional NEs (Persons, Locations, etc.). This syntactic ambiguity makes it challenging to recognize them based on context. We organized the MultiCoNER task (Malmasi et al., 2022) at SemEval-2022 to address these challenges in 11 languages, receiving a very positive community response with 34 system papers. Results confirmed the challenges of processing complex and long-tail NEs: even the largest pre-trained Transformers did not achieve top performance without external knowledge. The top systems infused transformers with knowledge bases and gazetteers. However, such solutions are brittle against out of knowledge-base entities and noisy scenarios like the presence of spelling mistakes and typos. We propose MultiCoNER II which represents novel challenges through new tasks that emphasize the shortcomings of the current top models.\n\nMultiCoNER II features complex NER in these languages:\n\n1. English\n2. Spanish\n3. Hindi\n4. Bangla\n5. Chinese\n6. Swedish\n7. Farsi\n8. French\n9. Italian\n10. Portugese\n11. Ukranian\n12. German\n\nFor more details see https://multiconer.github.io/\n\n## References\n* Sandeep Ashwini and Jinho D. Choi. 2014. Targetable named entity recognition in social media. CoRR, abs/1408.0782.\n* Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg Rokhlenko. 2022. SemEval-2022 Task 11: Multilingual Complex Named Entity Recognition (MultiCoNER).","downloads":884,"tags":["task_categories:token-classification","language:bn","language:zh","language:de","language:en","language:es","language:fa","language:fr","language:hi","language:it","language:pt","language:sv","language:uk","license:cc-by-4.0","size_categories:1M<n<10M","modality:text","library:datasets","library:mlcroissant","region:us","multiconer","ner","multilingual","named entity recognition","fine-grained ner"],"createdAt":"2023-03-01T00:57:16.000Z","key":""},{"_id":"6406e186cf5e3e7bd5f3e198","id":"bmd1905/error-correction-vi","author":"bmd1905","disabled":false,"gated":false,"lastModified":"2023-03-07T07:30:51.000Z","likes":8,"trendingScore":1,"private":false,"sha":"58cbba445c2f73bd0987318d9f2c35d1415f310c","downloads":41,"tags":["language:vi","license:apache-2.0","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-03-07T07:02:30.000Z","key":""},{"_id":"6411d2e26b4ee779dbcdf009","id":"HuggingFaceGECLM/REDDIT_comments","author":"HuggingFaceGECLM","disabled":false,"gated":false,"lastModified":"2023-03-17T07:52:51.000Z","likes":21,"trendingScore":1,"private":false,"sha":"54779d3d1f1c1b12e5989f695e13d38b394a558f","description":"\n\t\n\t\t\n\t\tDataset Card for \"REDDIT_comments\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nComments of 50 high-quality subreddits, extracted from the REDDIT PushShift data dumps (from 2006 to Jan 2023).\n\n\t\n\t\t\n\t\tSupported Tasks\n\t\n\nThese comments can be used for text generation and language modeling, as well as dialogue modeling.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Splits\n\t\n\nEach split corresponds to a specific subreddit in the following list: \"tifu\", \"explainlikeimfive\", \"WritingPrompts\", \"changemyview\"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments.","downloads":3205,"tags":["task_categories:text-generation","task_ids:dialogue-modeling","task_ids:language-modeling","annotations_creators:no-annotation","language_creators:found","multilinguality:monolingual","language:en","size_categories:100M<n<1B","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2001.08435","region:us","reddit","social-media"],"createdAt":"2023-03-15T14:14:58.000Z","key":""},{"_id":"641678db756ca5c234037543","id":"RyokoAI/Fandom23K","author":"RyokoAI","disabled":false,"gated":false,"lastModified":"2023-03-20T19:58:46.000Z","likes":32,"trendingScore":1,"private":false,"sha":"a754b3eaad00502435532e9b54c792029382796c","description":"\n\t\n\t\t\n\t\tDataset Card for Fandom23K\n\t\n\nThe BigKnow2022 dataset and its subsets are not yet complete. Not all information here may be accurate or accessible.\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nFandom23K is a dataset composed of 15,616,749 articles scraped from approximately 23,665 Fandom.com wikis between March 14 and March 18, 2023.\nIt is a subset of the upcoming BigKnow2022 dataset.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nThis dataset is primarily intended for unsupervised training of text… See the full description on the dataset page: https://huggingface.co/datasets/RyokoAI/Fandom23K.","downloads":481,"tags":["task_categories:text-classification","task_categories:text-generation","language:en","license:cc-by-sa-3.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","wiki","training"],"createdAt":"2023-03-19T02:52:11.000Z","key":""},{"_id":"641eaf99ce000d321824a8fe","id":"acheong08/nsfw_reddit","author":"acheong08","disabled":false,"gated":false,"lastModified":"2023-04-09T13:44:10.000Z","likes":41,"trendingScore":1,"private":false,"sha":"a6fe04cf83c91fcf2258735d4fb34595443424bf","downloads":61,"tags":["license:openrail","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-03-25T08:23:53.000Z","key":""},{"_id":"641edcf5485c37e54d68fb57","id":"shibing624/alpaca-zh","author":"shibing624","disabled":false,"gated":false,"lastModified":"2023-05-10T06:09:06.000Z","likes":144,"trendingScore":1,"private":false,"sha":"f39db019a94f8dbea48ab30d2bdc090703284559","description":"\n\t\n\t\t\n\t\tDataset Card for \"alpaca-zh\"\n\t\n\n本数据集是参考Alpaca方法基于GPT4得到的self-instruct数据，约5万条。\nDataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM \nIt is the chinese dataset from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data_zh.json\n\n\t\n\t\t\n\t\n\t\n\t\tUsage and License Notices\n\t\n\nThe data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not… See the full description on the dataset page: https://huggingface.co/datasets/shibing624/alpaca-zh.","downloads":2004,"tags":["task_categories:text-generation","language:zh","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2304.03277","region:us","gpt","alpaca","fine-tune","instruct-tune","instruction"],"createdAt":"2023-03-25T11:37:25.000Z","key":""},{"_id":"6420c2ab0f061ae62185d3b8","id":"sahil2801/CodeAlpaca-20k","author":"sahil2801","disabled":false,"gated":false,"lastModified":"2023-10-03T11:46:04.000Z","likes":236,"trendingScore":1,"private":false,"sha":"152bb5e9a29651266b018106053980070a0521a1","downloads":21890,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","code"],"createdAt":"2023-03-26T22:09:47.000Z","key":""},{"_id":"6422ad667f15837751c8d53d","id":"TeamSODA/LibriTTS","author":"TeamSODA","disabled":false,"gated":false,"lastModified":"2023-03-28T12:31:28.000Z","likes":1,"trendingScore":1,"private":false,"sha":"4981699b1e45744922c54aaffba75adffb47ad26","description":"\n\t\n\t\t\n\t\tUsage\n\t\n\nfrom datasets import load_dataset\ndataset = load_dataset('TeamSODA/LibriTTS', streaming=True)\n\n","downloads":88,"tags":["size_categories:10K<n<100K","format:parquet","modality:audio","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-03-28T09:03:34.000Z","key":""},{"_id":"6422e7a0bbbbabc97c8a4d97","id":"tsdocode/vi_alpaca_clean","author":"tsdocode","disabled":false,"gated":false,"lastModified":"2023-03-28T13:14:52.000Z","likes":6,"trendingScore":1,"private":false,"sha":"d41246ca09ebc5bd7aad33ed50ffa62d27460b42","downloads":53,"tags":["task_categories:text-generation","language:vi","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","instruction-finetuning"],"createdAt":"2023-03-28T13:12:00.000Z","key":""},{"_id":"64235dc97c55aeb6f72ed098","id":"laion/relaion400m","author":"laion","disabled":false,"gated":"auto","lastModified":"2024-07-13T21:19:34.000Z","likes":75,"trendingScore":1,"private":false,"sha":"2bf7009eb1967520f9cc5127d858cd3e39427b3e","downloads":865,"tags":["size_categories:100M<n<1B","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-03-28T21:36:09.000Z","key":""},{"_id":"642451ec10c8e82b0947433d","id":"vietgpt/daily_dialog_vi","author":"vietgpt","disabled":false,"gated":false,"lastModified":"2023-06-21T14:11:16.000Z","likes":4,"trendingScore":1,"private":false,"sha":"7f0c6eef81786872a00938f1fb942aee930a1508","description":"\n\t\n\t\t\n\t\tDailyDialog\n\t\n\n\nSource: https://huggingface.co/datasets/daily_dialog\nNum examples: \n11,118 (train)\n1,000 (validation)\n1,000 (test)\n\n\nLanguage: Vietnamese\n\nfrom datasets import load_dataset\n\nload_dataset(\"vietgpt/daily_dialog_vi\")\n\n","downloads":44,"tags":["language:vi","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","SFT"],"createdAt":"2023-03-29T14:57:48.000Z","key":""},{"_id":"642551c62b05650026adfdd2","id":"Francesco/radio-signal","author":"Francesco","disabled":false,"gated":false,"lastModified":"2023-03-30T09:10:00.000Z","likes":3,"trendingScore":1,"private":false,"sha":"4b88e2c64beca7f6cb67ff8e5d96aa22b9bfe33c","description":"\n\t\n\t\t\n\t\tDataset Card for radio-signal\n\t\n\n** The original COCO dataset is stored at dataset.tar.gz**\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nradio-signal\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\n\nobject-detection: The dataset can be used to train a model for Object Detection.\n\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nEnglish\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\nA data point comprises an image and its object annotations.\n{\n  'image_id': 15,\n  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB… See the full description on the dataset page: https://huggingface.co/datasets/Francesco/radio-signal.","downloads":154,"tags":["task_categories:object-detection","annotations_creators:crowdsourced","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","license:cc","size_categories:1K<n<10K","format:parquet","modality:image","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","rf100"],"createdAt":"2023-03-30T09:09:26.000Z","key":""},{"_id":"6427bf06188b9f01b17e636b","id":"LinaAlhuri/ArabicConceptualCaptions3M","author":"LinaAlhuri","disabled":false,"gated":false,"lastModified":"2023-11-15T09:24:55.000Z","likes":3,"trendingScore":1,"private":false,"sha":"9aebbaf2949f6d71ea97f6b147dab0e1235493fd","description":"\n\t\n\t\t\n\t\tArabic Translated Conceptual Captions Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset consists of conceptual captions translated into Arabic using the Google Translate API. It serves as a resource for researchers and developers interested in exploring the vision-language tasks and biases introduced during the translation process.\n\n\t\n\t\t\n\t\tDataset Information\n\t\n\n\nSource Dataset: Conceptual Captions\nTranslation Tool: Google Translate API\nTranslation Language: English to Arabic… See the full description on the dataset page: https://huggingface.co/datasets/LinaAlhuri/ArabicConceptualCaptions3M.","downloads":68,"tags":["task_categories:image-to-text","language:ar","size_categories:1M<n<10M","format:csv","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-01T05:20:06.000Z","key":""},{"_id":"642912f7a760fe0bf37996b1","id":"anon8231489123/ShareGPT_Vicuna_unfiltered","author":"anon8231489123","disabled":false,"gated":false,"lastModified":"2023-04-12T05:23:59.000Z","likes":883,"trendingScore":1,"private":false,"sha":"192ab2185289094fc556ec8ce5ce1e8e587154ca","description":"Further cleaning done. Please look through the dataset and ensure that I didn't miss anything.\nUpdate: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c\nTwo choices:\n\nRemoves instances of \"I'm sorry, but\": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json\nHas instances of \"I'm sorry, but\":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.","downloads":195113,"tags":["language:en","license:apache-2.0","region:us"],"createdAt":"2023-04-02T05:30:31.000Z","key":""},{"_id":"642afd6dfe4a12faa4293649","id":"pythainlp/thailaw","author":"pythainlp","disabled":false,"gated":false,"lastModified":"2023-05-21T14:34:49.000Z","likes":11,"trendingScore":1,"private":false,"sha":"c62670ca8a45de3cd241465867a8dd592f4b5dc2","description":"\n\t\n\t\t\n\t\tDataset Card for \"thailaw\"\n\t\n\n\n\t\n\t\t\n\t\tEnglish\n\t\n\nThai Law Dataset (Act of Parliament)\n\nData source from Office of the Council of State, Thailand. https://www.krisdika.go.th/\nThis part of PyThaiNLP Project.\nLicense Dataset is public domain.\n\nDownload https://github.com/PyThaiNLP/thai-law/releases\nThis hub based on Thailaw v0.2.\n\n\t\n\t\t\n\t\n\t\n\t\tThai\n\t\n\nคลังข้อมูลกฎหมายไทย (พระราชบัญญัติ)\n\nข้อมูลเก็บรวบรวมมาจากเว็บไซต์สำนักงานคณะกรรมการกฤษฎีกา https://www.krisdika.go.th/… See the full description on the dataset page: https://huggingface.co/datasets/pythainlp/thailaw.","downloads":64,"tags":["task_categories:text-generation","language:th","license:cc0-1.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","legal"],"createdAt":"2023-04-03T16:23:09.000Z","key":""},{"_id":"642c21d86c11a741c521acc3","id":"maykcaldas/smiles-transformers","author":"maykcaldas","disabled":false,"gated":false,"lastModified":"2023-04-04T22:02:47.000Z","likes":22,"trendingScore":1,"private":false,"sha":"abb3705e6c68c767854f35a1d9fd90d84f088393","description":"\n\t\n\t\t\n\t\tsmiles-transformers dataset\n\t\n\n  TODO: Add references to the datasets we curated\n\n\t\n\t\t\n\t\tdataset features\n\t\n\n\nname: text\nMolecule SMILES : string\n\n\nname: formula\nMolecular formula : string\n\n\nname: NumHDonors\nNumber of hidrogen bond donors : int\n\n\nname: NumHAcceptors\nNumber of hidrogen bond acceptors : int\n\n\nname: MolLogP\nWildman-Crippen LogP : float\n\n\nname: NumHeteroatoms\nNumber of hetero atoms: int\n\n\nname: RingCount\nNumber of rings : int\n\n\nname: NumRotatableBonds\nNumber of rotable… See the full description on the dataset page: https://huggingface.co/datasets/maykcaldas/smiles-transformers.","downloads":1986,"tags":["language:en","license:mit","size_categories:1B<n<10B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-04T13:10:48.000Z","key":""},{"_id":"642c2480d0c59e066d01a44a","id":"fddemarco/pushshift-reddit-comments","author":"fddemarco","disabled":false,"gated":false,"lastModified":"2023-05-14T17:19:16.000Z","likes":27,"trendingScore":1,"private":false,"sha":"987111c7c262a3c1255f5da4ecc545782280e794","description":"\n\t\n\t\t\n\t\tDataset Card for \"pushshift-reddit\"\n\t\n\nMore Information needed\n","downloads":3114,"tags":["size_categories:1B<n<10B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-04T13:22:08.000Z","key":""},{"_id":"642d8724baf943d5db411db8","id":"KK04/LogicInference_OA","author":"KK04","disabled":false,"gated":false,"lastModified":"2023-04-05T15:38:22.000Z","likes":21,"trendingScore":1,"private":false,"sha":"a4d55d92425f062c4e51c37e08045df5fae4fc2f","description":"\n\t\n\t\t\n\t\tDataset Card for \"LogicInference_OA\"\n\t\n\nThis is an re-produce of the dataset from LogicInference Dataset in paper: https://openreview.net/pdf?id=HAGeIS_Lcg9.\nThe github page of LogicInference Dataset: https://github.com/google-research/google-research/tree/master/logic_inference_dataset.\nThis dataset is aimed to offer more dataset for Open Assistant project, depending on their demands, there three columns: INSTRUCTION, RESPONSE, SOURCE.\nThe results in this dataset is a little different… See the full description on the dataset page: https://huggingface.co/datasets/KK04/LogicInference_OA.","downloads":66,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","Logic Inference"],"createdAt":"2023-04-05T14:35:16.000Z","key":""},{"_id":"642f19fd4ed6df849978b12a","id":"huanngzh/anime_face_control_60k","author":"huanngzh","disabled":false,"gated":false,"lastModified":"2023-04-07T02:20:48.000Z","likes":6,"trendingScore":1,"private":false,"sha":"04e4f951a987a668cd90e43c5db1bcbf74b986c2","description":"\n\t\n\t\t\n\t\tDataset Card for \"acgn_face_control_60k\"\n\t\n\nMore Information needed\n","downloads":26,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-06T19:14:05.000Z","key":""},{"_id":"6430436366bbb3a3fc51f726","id":"vicgalle/alpaca-gpt4","author":"vicgalle","disabled":false,"gated":false,"lastModified":"2024-02-10T10:03:45.000Z","likes":325,"trendingScore":1,"private":false,"sha":"f7e3ded725cb81e8e564e32feb12860f376f2b51","description":"\n\t\n\t\t\n\t\tDataset Card for \"alpaca-gpt4\"\n\t\n\nThis dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.\nThe dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset structure\n\t\n\nIt contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca.\nThe dataset has the… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/alpaca-gpt4.","downloads":5061,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:cc-by-nc-4.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2304.03277","region:us","gpt4","alpaca","instruction-finetuning","synthetic"],"createdAt":"2023-04-07T16:22:59.000Z","key":""},{"_id":"643078357195f4075a3115a3","id":"QuixiAI/leet10k-alpaca","author":"QuixiAI","disabled":false,"gated":false,"lastModified":"2023-05-02T05:44:45.000Z","likes":20,"trendingScore":1,"private":false,"sha":"1c437ebf0393e6f57a499c39f587ba328d919c7a","downloads":84,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-07T20:08:21.000Z","key":""},{"_id":"6430888aa04a488be637250f","id":"azhx/counterfact","author":"azhx","disabled":false,"gated":false,"lastModified":"2023-04-07T21:22:57.000Z","likes":7,"trendingScore":1,"private":false,"sha":"c01c413f856ee38f5c080c9fc5e87aff478e2ff9","description":"\n\t\n\t\t\n\t\tDataset Card for \"counterfact\"\n\t\n\nDataset from ROME by Meng et al.\nMore Information needed\n","downloads":1921,"tags":["size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-07T21:18:02.000Z","key":""},{"_id":"6431d4676758730899da6251","id":"IlyaGusev/ru_turbo_saiga","author":"IlyaGusev","disabled":false,"gated":false,"lastModified":"2023-09-04T13:26:47.000Z","likes":29,"trendingScore":1,"private":false,"sha":"4f92f393386e9e623f39832411aeee3064b0890c","description":"\n\t\n\t\t\n\t\tSaiga\n\t\n\nDataset of ChatGPT-generated chats in Russian.\n\n\nBased on the Baize paper.\nCode: link.\nPrompt:\nИдёт диалог между пользователем и ИИ ассистентом.\nПользователь и ассистент общаются на тему: {{seed}}\nРеплики человека начинаются с [Пользователь], реплики ассистента начинаются с [Ассистент].\nПользователь задаёт вопросы на основе темы и предыдущих сообщений.\nПользователь обрывает беседу, когда у него не остается вопросов.\nАссистент даёт максимально полные, информативные, точные и… See the full description on the dataset page: https://huggingface.co/datasets/IlyaGusev/ru_turbo_saiga.","downloads":582,"tags":["task_categories:text-generation","language:ru","license:cc-by-4.0","size_categories:10K<n<100K","region:us","chat"],"createdAt":"2023-04-08T20:53:59.000Z","key":""},{"_id":"6435bcb2ae81e0b346798438","id":"BHO/docs","author":"BHO","disabled":false,"gated":false,"lastModified":"2023-04-15T19:46:34.000Z","likes":1,"trendingScore":1,"private":false,"sha":"b46a56d44c9094b6e949160ed81334e39209eee9","downloads":136,"tags":["license:openrail","size_categories:n<1K","modality:document","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-04-11T20:01:54.000Z","key":""},{"_id":"6436e18f5f36a7ed0f470a3b","id":"HuggingFaceH4/databricks_dolly_15k","author":"HuggingFaceH4","disabled":false,"gated":false,"lastModified":"2023-04-12T17:11:41.000Z","likes":23,"trendingScore":1,"private":false,"sha":"222c42f7bef7c78770f538e8364a40b65aad2d2e","description":"\n\t\n\t\t\n\t\tDataset Card for Dolly_15K\n\t\n\n\n\t\n\t\t\n\t\tSummary\n\t\n\ndatabricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.\nThis dataset can be used for any purpose, whether academic or commercial,  under the terms of the Creative Commons… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/databricks_dolly_15k.","downloads":764,"tags":["license:cc-by-3.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2203.02155","region:us"],"createdAt":"2023-04-12T16:51:27.000Z","key":""},{"_id":"6436f81eb49082de74083f7a","id":"LEAP/ClimSim_high-res","author":"LEAP","disabled":false,"gated":false,"lastModified":"2023-09-29T20:30:24.000Z","likes":13,"trendingScore":1,"private":false,"sha":"f8dda6917eea8e8e05d7037cbc18c240aa9af2a9","description":"The corresponding GitHub repo can be found here:https://github.com/leap-stc/ClimSim\nRead more: https://arxiv.org/abs/2306.08754.\n","downloads":1049,"tags":["license:cc-by-4.0","arxiv:2306.08754","doi:10.57967/hf/0739","region:us"],"createdAt":"2023-04-12T18:27:42.000Z","key":""},{"_id":"643afc3b45200ac3e7068ed1","id":"nyuuzyou/AnimeHeadsv3","author":"nyuuzyou","disabled":false,"gated":false,"lastModified":"2023-07-02T23:24:38.000Z","likes":7,"trendingScore":1,"private":false,"sha":"88f79d16bd1bfa2705c1482be61c40662de00689","description":"\n\t\n\t\t\n\t\tAnimeHeadsv3 Object Detection Dataset\n\t\n\nThe AnimeHeadsv3 Object Detection Dataset is a collection of anime and art images, including manga pages, that have been annotated with object bounding boxes for use in object detection tasks.\n\n\t\n\t\t\n\t\tContents\n\t\n\nThere are two versions of the dataset available:\nThe dataset contains a total of 8157 images, split into training, validation, and testing sets. The images were collected from various sources and include a variety of anime and art… See the full description on the dataset page: https://huggingface.co/datasets/nyuuzyou/AnimeHeadsv3.","downloads":117,"tags":["task_categories:object-detection","license:wtfpl","size_categories:10K<n<100K","modality:image","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-04-15T19:34:19.000Z","key":""},{"_id":"643ce713099590e9ed8f29f7","id":"togethercomputer/RedPajama-Data-1T","author":"togethercomputer","disabled":false,"gated":false,"lastModified":"2024-06-17T11:36:03.000Z","likes":1171,"trendingScore":1,"private":false,"sha":"398f92572e94f4793e41c22ab7ea2a788d9e7de4","description":"RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.","downloads":1974,"tags":["task_categories:text-generation","language:en","size_categories:1M<n<10M","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-04-17T06:28:35.000Z","key":""},{"_id":"643e51b639345ed946ded395","id":"lksy/ru_instruct_gpt4","author":"lksy","disabled":false,"gated":false,"lastModified":"2023-06-02T16:56:03.000Z","likes":38,"trendingScore":1,"private":false,"sha":"5bc9536d264666378f4fb58a4bbd87b45221c9d5","description":"\n\t\n\t\t\n\t\tru_instruct_gpt4\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nDataset of GPT-4 generated instructions in Russian. Will soon be updated with more examples.\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nRussian\n","downloads":156,"tags":["task_categories:text-generation","language:ru","license:cc-by-4.0","size_categories:10K<n<100K","region:us","chat"],"createdAt":"2023-04-18T08:15:50.000Z","key":""},{"_id":"64401c614164a65ca129bd34","id":"iamketan25/gsm-general-qa-instructions","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-19T16:53:17.000Z","likes":5,"trendingScore":1,"private":false,"sha":"9e3e29afd6b7300703d5be723d03ad056b494ad6","downloads":78,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-19T16:52:49.000Z","key":""},{"_id":"644201ac55a16ae60fa855ad","id":"b-mc2/sql-create-context","author":"b-mc2","disabled":false,"gated":false,"lastModified":"2024-01-25T22:01:25.000Z","likes":501,"trendingScore":1,"private":false,"sha":"9d80a6a118b838d9defc3798d659a54a2ac2ff37","description":"\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset builds from WikiSQL and Spider.\nThere are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.","downloads":4148,"tags":["task_categories:text-generation","task_categories:question-answering","task_categories:table-question-answering","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:1809.08887","region:us","SQL","code","NLP","text-to-sql","context-sql","spider","wikisql","sqlglot"],"createdAt":"2023-04-21T03:23:24.000Z","key":""},{"_id":"644391373dc283776326a854","id":"iamketan25/poem-instructions-dataset","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-22T07:48:49.000Z","likes":4,"trendingScore":1,"private":false,"sha":"66cc52a68428447914ef7c236128f1dc0aaa1c1a","downloads":104,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-22T07:48:07.000Z","key":""},{"_id":"6444bb1e8f795c936d0cc3fc","id":"iamketan25/cmv-instructions","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-23T04:59:33.000Z","likes":2,"trendingScore":1,"private":false,"sha":"221c9c83cbf51fa35411c050f9e0c81ad8fd11fc","downloads":73,"tags":["size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-23T04:59:10.000Z","key":""},{"_id":"6444bb538f795c936d0cc83d","id":"iamketan25/essay-instructions-dataset","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-23T05:01:16.000Z","likes":4,"trendingScore":1,"private":false,"sha":"f3602f1db195042c3a25c12d6202f7bd775e1a33","downloads":81,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-23T05:00:03.000Z","key":""},{"_id":"6444c13d3dc283776339c17b","id":"iamketan25/summarize-instructions-dataset","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-23T05:26:26.000Z","likes":3,"trendingScore":1,"private":false,"sha":"2c675e4b82ed0c3a5f3b1552c2c9e6b99994c1ae","downloads":76,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-23T05:25:17.000Z","key":""},{"_id":"6444c2603dc283776339d436","id":"iamketan25/python-qa-instructions-dataset","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-23T05:30:34.000Z","likes":16,"trendingScore":1,"private":false,"sha":"a79a22805cb4a428f0bd0ea7eee220e7bdee9b76","downloads":120,"tags":["size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-23T05:30:08.000Z","key":""},{"_id":"64470372f9dc06bea2aaec6f","id":"iamketan25/roleplay-instructions-dataset","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-24T22:32:40.000Z","likes":33,"trendingScore":1,"private":false,"sha":"30118eb72dc4c04965e4af6cfc4c07bb3edef160","downloads":138,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-24T22:32:18.000Z","key":""},{"_id":"644810eb3411a0902bbe11f1","id":"iamketan25/open-assistant-instructions","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-25T17:42:38.000Z","likes":8,"trendingScore":1,"private":false,"sha":"67712315bb0d41a124aff3259932d68d218e0bea","downloads":70,"tags":["size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-25T17:42:03.000Z","key":""},{"_id":"6448d53ef88f1495f0907f1c","id":"iamketan25/oig-instructions-dataset","author":"iamketan25","disabled":false,"gated":false,"lastModified":"2023-04-26T07:41:47.000Z","likes":2,"trendingScore":1,"private":false,"sha":"4512a39340b3166ecb21af0ef56a97335602b147","downloads":63,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-26T07:39:42.000Z","key":""},{"_id":"644b452caf97dfd24c1a555f","id":"SaiedAlshahrani/Egyptian_Arabic_Wikipedia_20230101","author":"SaiedAlshahrani","disabled":false,"gated":false,"lastModified":"2024-01-05T15:17:57.000Z","likes":6,"trendingScore":1,"private":false,"sha":"b867f3db7d6bbb8b8b0214774c2f5a34e7699c3b","description":"\n\t\n\t\t\n\t\tDataset Card for \"Egyptian_Arabic_Wikipedia_20230101\"\n\t\n\nThis dataset is created using the Egyptian Arabic Wikipedia articles, downloaded on the 1st of January 2023, processed using Gensim Python library, and preprocessed using tr Linux/Unix utility and CAMeLTools Python toolkit for Arabic NLP. This dataset was used to train this Egyptian Arabic Wikipedia Masked Language Model: SaiedAlshahrani/arzwiki_20230101_roberta_mlm.\nFor more details about the dataset, please read and cite our… See the full description on the dataset page: https://huggingface.co/datasets/SaiedAlshahrani/Egyptian_Arabic_Wikipedia_20230101.","downloads":37,"tags":["language:ar","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-04-28T04:01:48.000Z","key":""},{"_id":"644cf8a20dc952d245989c08","id":"Maciel/FinCUGE-Instruction","author":"Maciel","disabled":false,"gated":false,"lastModified":"2023-08-20T02:26:39.000Z","likes":35,"trendingScore":1,"private":false,"sha":"e37b74d83c61ba3674b290c68c13e393aa6adbaf","description":"\n\t\n\t\t\n\t\tDataset Card for Dataset Name\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n本数据集包含八项中文金融自然语言处理基准任务，分别为金融新闻摘要(FinNA)、金融新闻公告事件问答(FinQA)、金融新闻分类(FinNL)、金融新闻关系抽取(FinRE)、金融社交媒体文本情绪分类(FinNE)、金融负面消息及其主体判定(FinNSP)、金融因果事件抽取(FinCQA)、金融事件主体抽取(FinESE)。\n\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n（1）FinNA\n金融新闻摘要数据集。输入一段金融新闻，需要模型生成一句话摘要。其中训练集包含24000条数据，验证集包含3000条数据。\n{\n  \"instruction\": \"根据以下新闻生成摘要。\",\n  \"input\":… See the full description on the dataset page: https://huggingface.co/datasets/Maciel/FinCUGE-Instruction.","downloads":246,"tags":["task_categories:question-answering","language:zh","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","finance"],"createdAt":"2023-04-29T10:59:46.000Z","key":""},{"_id":"64507fa2577838187e093017","id":"kkcosmos/instagram-images-with-captions","author":"kkcosmos","disabled":false,"gated":false,"lastModified":"2023-05-04T18:23:34.000Z","likes":24,"trendingScore":1,"private":false,"sha":"9d7bd2f4cfc772b9a0fe7b80c6a528bfaf54ff7d","downloads":199,"tags":["license:unknown","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-02T03:12:34.000Z","key":""},{"_id":"6451a2ee41f3c769b91b2685","id":"liuhaotian/LLaVA-Pretrain","author":"liuhaotian","disabled":false,"gated":false,"lastModified":"2023-07-06T08:47:38.000Z","likes":223,"trendingScore":1,"private":false,"sha":"70f9d1e5e1a697fe35830875cfc7de1dd590d727","description":"\n\t\n\t\t\n\t\tLLaVA Visual Instruct Pretrain Dataset Card\n\t\n\n\n\t\n\t\t\n\t\tDataset details\n\t\n\nDataset type:\nLLaVA Visual Instruct Pretrain LCS-558K is a subset of LAION/CC/SBU dataset, filtered with a more balanced concept coverage distribution.\nCaptions are also associated with BLIP synthetic caption for reference.\nIt is constructed for the pretraining stage for feature alignment in visual instruction tuning.\nWe aim to build large multimodal towards GPT-4 vision/language capability.\nDataset date:\nLLaVA… See the full description on the dataset page: https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain.","downloads":1527,"tags":["language:en","license:other","modality:image","region:us"],"createdAt":"2023-05-02T23:55:26.000Z","key":""},{"_id":"645213f394a54195ce4f74d2","id":"corvj/daps","author":"corvj","disabled":false,"gated":false,"lastModified":"2023-05-03T12:52:18.000Z","likes":3,"trendingScore":1,"private":false,"sha":"83b473866ae13c4cbf121dd5bdb150e314499653","citation":"@article{mysore2014can,\n  title={Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges},\n  author={Mysore, Gautham J},\n  journal={IEEE Signal Processing Letters},\n  volume={22},\n  number={8},\n  pages={1006--1010},\n  year={2014},\n  publisher={IEEE}\n}","description":"The DAPS (Device and Produced Speech) dataset is a collection of aligned versions of professionally produced studio speech recordings and recordings of the same speech on common consumer devices (tablet and smartphone) in real-world environments. It has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4 1/2 hours of data (about 14 minutes from each of 20 speakers).","downloads":18,"tags":["language:en","region:us"],"createdAt":"2023-05-03T07:57:39.000Z","key":""},{"_id":"64579693711ee86f6eee4364","id":"phongmt184172/mtet","author":"phongmt184172","disabled":false,"gated":false,"lastModified":"2023-05-08T07:41:53.000Z","likes":11,"trendingScore":1,"private":false,"sha":"f1ada3b8da8a6c8ea13d60c0df198a0037074357","description":"load_dataset('phongmt184172/mtet')\nThe dataset is cloned https://github.com/vietai/mTet for machine translation task.\n","downloads":202,"tags":["task_categories:translation","language:en","language:vi","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-07T12:16:19.000Z","key":""},{"_id":"6458b1ecf92601affa29b4b8","id":"thu-coai/kdconv","author":"thu-coai","disabled":false,"gated":false,"lastModified":"2023-05-08T10:39:46.000Z","likes":5,"trendingScore":1,"private":false,"sha":"460c94a39c1498241b3c7e94a22be25e1489601e","description":"The KDConv dataset. GitHub repo. Original paper.\n@inproceedings{zhou-etal-2020-kdconv,\n    title = \"{K}d{C}onv: A {C}hinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation\",\n    author = \"Zhou, Hao  and\n      Zheng, Chujie  and\n      Huang, Kaili  and\n      Huang, Minlie  and\n      Zhu, Xiaoyan\",\n    booktitle = \"ACL\",\n    year = \"2020\"\n}\n\n","downloads":1556,"tags":["language:zh","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","arxiv:2004.04100","region:us"],"createdAt":"2023-05-08T08:25:16.000Z","key":""},{"_id":"645d03ed8ce4443cae709b55","id":"koutch/falcon_code","author":"koutch","disabled":false,"gated":"manual","lastModified":"2025-08-12T17:36:47.000Z","likes":11,"trendingScore":1,"private":false,"sha":"98692ffe504130851ee9c020937f4fa1f9f28cb9","description":"\n\t\n\t\t\n\t\tFalconCode\n\t\n\nFalconCode is a large-scale dataset of student programming solutions, collected from multiple introductory programming courses.It is designed for research on automatic programming feedback, code understanding, and educational AI.The dataset has been curated and processed for SIGCSE 2024 and is described in detail in the FalconCode project page.\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nFalconCode contains programming assignments, student submissions, execution results, and rich… See the full description on the dataset page: https://huggingface.co/datasets/koutch/falcon_code.","downloads":65,"tags":["task_categories:text-generation","language:en","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","code"],"createdAt":"2023-05-11T15:04:13.000Z","key":""},{"_id":"645e3fcc43abb116540da202","id":"KakologArchives/KakologArchives","author":"KakologArchives","disabled":false,"gated":false,"lastModified":"2026-06-30T08:43:24.000Z","likes":58,"trendingScore":1,"private":false,"sha":"2bded9c63310c154a20d8414c4926a21cd769bfc","description":"\n\t\n\t\t\n\t\n\t\n\t\tニコニコ実況 過去ログアーカイブ\n\t\n\nニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。\n去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり（事実上のサービス終了）、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。\nそこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.","downloads":2021416,"tags":["task_categories:text-classification","language:ja","license:mit","region:us"],"createdAt":"2023-05-12T13:31:56.000Z","key":""},{"_id":"6462e0c0cce92c7d883113f5","id":"ceval/ceval-exam","author":"ceval","disabled":false,"gated":false,"lastModified":"2025-07-27T03:59:42.000Z","likes":302,"trendingScore":1,"private":false,"sha":"617524a00b307ff6f9933702f724131fe12ca7ce","description":"C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details.\nEach subject consists of three splits: dev, val, and test.  The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.","downloads":37788,"tags":["task_categories:text-classification","task_categories:multiple-choice","task_categories:question-answering","language:zh","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2305.08322","region:us"],"createdAt":"2023-05-16T01:47:44.000Z","key":""},{"_id":"64636eaf95e64061c97b79ec","id":"J-Mourad/MNAD.v2","author":"J-Mourad","disabled":false,"gated":false,"lastModified":"2023-05-16T12:22:21.000Z","likes":1,"trendingScore":1,"private":false,"sha":"b6ab21aa6fd4b466b88d6de69a86f94da9ef0101","description":"\n\t\n\t\t\n\t\tAbout the MNAD Dataset\n\t\n\nThe MNAD corpus is a collection of over 1 million Moroccan news articles written in the modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Fields\n\t\n\n\nTitle: The title of… See the full description on the dataset page: https://huggingface.co/datasets/J-Mourad/MNAD.v2.","downloads":15,"tags":["size_categories:1M<n<10M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-16T11:53:19.000Z","key":""},{"_id":"64683196e92e2372d5d94002","id":"linhtran92/viet_youtube_asr_corpus_v2","author":"linhtran92","disabled":false,"gated":false,"lastModified":"2023-05-20T03:34:49.000Z","likes":7,"trendingScore":1,"private":false,"sha":"9ff8558f64727c5d42cb641458928cd8e2c3efdf","description":"\n\t\n\t\t\n\t\tDataset Card for \"viet_youtube_asr_corpus_v2\"\n\t\n\nMore Information needed\n","downloads":206,"tags":["size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-20T02:33:58.000Z","key":""},{"_id":"646f1cefe2a72c647b6328fa","id":"Linly-AI/Chinese-pretraining-dataset","author":"Linly-AI","disabled":false,"gated":false,"lastModified":"2023-05-26T02:32:06.000Z","likes":44,"trendingScore":1,"private":false,"sha":"c4ee6b9ab78909729b7d00c5de5299f4662e5a64","description":"Data source: https://github.com/CVI-SZU/Linly/wiki/Linly-OpenLLaMA\n","downloads":470,"tags":["license:apache-2.0","size_categories:10M<n<100M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-25T08:31:43.000Z","key":""},{"_id":"646f6473fa9253f3ae8259c6","id":"TeamSODA/mcl-signal_processing_attacks_whisper_librispeech","author":"TeamSODA","disabled":false,"gated":false,"lastModified":"2023-05-25T14:13:03.000Z","likes":1,"trendingScore":1,"private":false,"sha":"889d2e9751c8ff326fb01f5ed84deb910d376e45","description":"\n\t\n\t\t\n\t\tDataset Card for \"mcl-signal_processing_attacks_large\"\n\t\n\nMore Information needed\n","downloads":110,"tags":["size_categories:1K<n<10K","format:parquet","modality:audio","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-25T13:36:51.000Z","key":""},{"_id":"647051609fe78d69a8a9f045","id":"beomi/KoAlpaca-v1.1a","author":"beomi","disabled":false,"gated":false,"lastModified":"2023-05-26T06:32:02.000Z","likes":63,"trendingScore":1,"private":false,"sha":"03d3122c530b1e47195c08a3d851eeadddad9689","description":"\n\t\n\t\t\n\t\tDataset Card for \"KoAlpaca-v1.1a\"\n\t\n\n\n\t\n\t\t\n\t\tProject Repo\n\t\n\n\nGithub Repo: Beomi/KoAlpaca\n\n\n\t\n\t\t\n\t\tHow to use\n\t\n\n>>> from datasets import load_dataset\n\n>>> ds = load_dataset(\"beomi/KoAlpaca-v1.1a\", split=\"train\")\n>>> ds\nDataset({\n    features: ['instruction', 'input', 'output'],\n    num_rows: 21155\n})\n\n>>> ds[0]\n{'instruction': '양파는 어떤 식물 부위인가요? 그리고 고구마는 뿌리인가요?',\n 'output': '양파는 잎이 아닌 식물의 줄기 부분입니다. 고구마는 식물의 뿌리 부분입니다. \\n\\n식물의 부위의 구분에 대해 궁금해하는 분이라면 분명 이 질문에 대한 답을 찾고 있을 것입니다. 양파는 잎이 아닌 줄기… See the full description on the dataset page: https://huggingface.co/datasets/beomi/KoAlpaca-v1.1a.","downloads":735,"tags":["task_categories:text-generation","language:ko","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","KoAlpaca"],"createdAt":"2023-05-26T06:27:44.000Z","key":""},{"_id":"6472f0626cff2f8672fbab98","id":"linhtran92/viet_bud500","author":"linhtran92","disabled":false,"gated":"auto","lastModified":"2024-02-29T10:21:53.000Z","likes":70,"trendingScore":1,"private":false,"sha":"4b30571f395781fddc3a4946fb378648f8572714","description":"\n\t\n\t\t\n\t\tBud500: A Comprehensive Vietnamese ASR Dataset\n\t\n\nIntroducing Bud500, a diverse Vietnamese speech corpus designed to support ASR research community. With aprroximately 500 hours of audio, it covers a broad spectrum of topics including podcast, travel, book, food, and so on, while spanning accents from Vietnam's North, South, and Central regions. Derived from free public audio resources, this publicly accessible dataset is designed to significantly enhance the work of developers and… See the full description on the dataset page: https://huggingface.co/datasets/linhtran92/viet_bud500.","downloads":907,"tags":["task_categories:automatic-speech-recognition","multilinguality:monolingual","language:vi","license:cc-by-nc-sa-4.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-28T06:10:42.000Z","key":""},{"_id":"64744452f9e3e0b312e56911","id":"gorilla-llm/APIBench","author":"gorilla-llm","disabled":false,"gated":false,"lastModified":"2023-05-29T06:31:49.000Z","likes":75,"trendingScore":1,"private":false,"sha":"ac21e1892e634dfa25f8ad75f16cbdbfb0a5736d","description":"\n\t\n\t\t\n\t\tGorilla: Large Language Model Connected with Massive APIs\n\t\n\nBy Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez  (Project Website)\n   \nGorilla enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla can write a semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the… See the full description on the dataset page: https://huggingface.co/datasets/gorilla-llm/APIBench.","downloads":932,"tags":["language:en","license:apache-2.0","arxiv:2305.15334","region:us","api"],"createdAt":"2023-05-29T06:21:06.000Z","key":""},{"_id":"6475b941ab17a37d0b18802b","id":"UniqueData/high_quality_webcam_video_attacks","author":"UniqueData","disabled":false,"gated":false,"lastModified":"2025-10-10T13:39:04.000Z","likes":4,"trendingScore":1,"private":false,"sha":"839928fdaa51f17ea2d6ee2c38036df7476f598d","citation":"@InProceedings{huggingface:dataset,\ntitle = {high_quality_webcam_video_attacks},\nauthor = {TrainingDataPro},\nyear = {2023}\n}","description":"The dataset includes live-recorded Anti-Spoofing videos from around the world,\ncaptured via **high-quality** webcams with Full HD resolution and above.","downloads":59,"tags":["task_categories:video-classification","language:en","license:cc-by-nc-nd-4.0","size_categories:10K<n<100K","modality:video","region:us","ibeta","replay attack","video","liveness detection","biometric","anti-spoofing"],"createdAt":"2023-05-30T08:52:17.000Z","key":""},{"_id":"64764cb157108da176fd10ad","id":"argilla/dolly-curated-comparison-falcon-7b-instruct","author":"argilla","disabled":false,"gated":false,"lastModified":"2023-07-13T11:28:57.000Z","likes":6,"trendingScore":1,"private":false,"sha":"2f83fb2d8dbc39cc45d702ba94a1f8b391aa82ca","description":"\n\t\n\t\t\n\t\tDataset Card for \"dolly-curated-comparison-falcon-7b-instruct\"\n\t\n\nThis dataset contains two generated responses using the falcon-7b-instruct model and the original, curated, prompt + responses from the Dolly v2 curated dataset. For now only 50% of the original dataset is available but we plan to complete it.\nThis dataset can be used for training a reward model for RLHF using Argilla Feedback\nMore Information needed\n","downloads":40,"tags":["language:en","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-05-30T19:21:21.000Z","key":""},{"_id":"647657ff6f0553f1b4f6b6e1","id":"talmp/en-vi-translation","author":"talmp","disabled":false,"gated":false,"lastModified":"2023-05-31T22:45:58.000Z","likes":11,"trendingScore":1,"private":false,"sha":"5f0e707f726348022e6f14b7aa1f39ab0c562284","description":"\n\t\n\t\t\n\t\tTo join all training set files together\n\t\n\nrun python join_dataset.py file, final result will be join_dataset.json file\n","downloads":99,"tags":["task_categories:translation","language:en","language:vi","license:wtfpl","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","region:us"],"createdAt":"2023-05-30T20:09:35.000Z","key":""},{"_id":"6477997fbb7681ad670b8898","id":"yuan-yang/MALLS-v0","author":"yuan-yang","disabled":false,"gated":false,"lastModified":"2023-10-25T20:16:00.000Z","likes":18,"trendingScore":1,"private":false,"sha":"d3cb90d034130c1269f4f49515d235e03271127e","description":"\n\t\n\t\t\n\t\tMALLS NL-FOL Pairs\n\t\n\n\n\t\n\t\t\n\t\tDataset details\n\t\n\nMALLS (large language Model generAted natural-Language-to-first-order-Logic pairS) \nconsists of pairs of real-world natural language (NL) statements and the corresponding first-order logic (FOL) rules annotations.\nAll pairs are generated by prompting GPT-4 and processed to ensure the validity of the FOL rules.\nMALLS-v0 consists of the original 34K NL-FOL pairs. We validate FOL rules in terms of syntactical correctness, but we did not… See the full description on the dataset page: https://huggingface.co/datasets/yuan-yang/MALLS-v0.","downloads":88,"tags":["task_categories:text-generation","language:en","license:cc-by-nc-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2305.15541","region:us"],"createdAt":"2023-05-31T19:01:19.000Z","key":""},{"_id":"647cdf261c0644de8d2fc307","id":"hssd/hssd-hab","author":"hssd","disabled":false,"gated":false,"lastModified":"2025-02-14T02:19:58.000Z","likes":45,"trendingScore":1,"private":false,"sha":"4369cb9876214c7fbebcf552eb532380e4d287e4","description":"\n\t\n\t\t\n\t\tHSSD: Habitat Synthetic Scenes Dataset\n\t\n\nThe Habitat Synthetic Scenes Dataset (HSSD) is a human-authored 3D scene dataset that more closely mirrors real scenes than prior datasets.\nOur dataset represents real interiors and contains a diverse set of 211 scenes and more than 18000 models of real-world objects.\n\n\nThis repository provides a Habitat consumption-ready compressed version of HSSD.\nSee this repository for corresponding uncompressed assets.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\n├──… See the full description on the dataset page: https://huggingface.co/datasets/hssd/hssd-hab.","downloads":18910,"tags":["language:en","license:cc-by-nc-4.0","region:us","3D scenes","Embodied AI"],"createdAt":"2023-06-04T18:59:50.000Z","key":""},{"_id":"647d8a9532c471a7fa7e122b","id":"kaist-ai/CoT-Collection","author":"kaist-ai","disabled":false,"gated":false,"lastModified":"2023-10-14T12:10:16.000Z","likes":163,"trendingScore":1,"private":false,"sha":"c9d352cdc119df4a4f7526d100e4acb4a72a7a5c","citation":"@article{kim2023cot,\n  title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning},\n  author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon},\n  journal={arXiv preprint arXiv:2305.14045},\n  year={2023}\n}","description":"\"\"\"\n\n_LICENSE = \"CC BY 4.0\"\n\n_HOMEPAGE = \"https://github.com/kaistAI/CoT-Collection\"\n\n_LANGUAGES = {\n    \"en\": \"English\",\n}\n# _ALL_LANGUAGES = \"all_languages\"\n\n\n\nclass CoTCollectionMultiConfig(datasets.BuilderConfig):","downloads":1550,"tags":["task_categories:text-generation","task_categories:text-classification","language:en","license:cc-by-4.0","size_categories:1M<n<10M","modality:text","library:datasets","library:mlcroissant","arxiv:2305.14045","region:us"],"createdAt":"2023-06-05T07:11:17.000Z","key":""},{"_id":"6480129486888bbffbe7ae53","id":"GAIR/lima","author":"GAIR","disabled":false,"gated":"auto","lastModified":"2023-06-08T02:40:19.000Z","likes":466,"trendingScore":1,"private":false,"sha":"68958e98267f5fb4a52a03ebcdae4ae59213fa7c","description":"A high-quality dataset for efficient instruction tuning.","downloads":1932,"tags":["license:other","size_categories:1K<n<10K","modality:text","library:datasets","library:mlcroissant","arxiv:2305.11206","region:us"],"createdAt":"2023-06-07T05:16:04.000Z","key":""},{"_id":"648430f4b8614941d652b18b","id":"cnut1648/instruction-attack-data","author":"cnut1648","disabled":false,"gated":false,"lastModified":"2023-06-10T08:19:53.000Z","likes":1,"trendingScore":1,"private":false,"sha":"93a6b36bbe3ed632c4b74e33fb41bb51c2d6a915","downloads":170,"tags":["region:us"],"createdAt":"2023-06-10T08:14:44.000Z","key":""},{"_id":"64846ae5b8614941d6625be8","id":"SohamGhadge/casual-conversation","author":"SohamGhadge","disabled":false,"gated":false,"lastModified":"2023-06-10T12:22:43.000Z","likes":34,"trendingScore":1,"private":false,"sha":"fee7b8e4a8fb991c7450d293e27b6f774505eba6","downloads":345,"tags":["size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-06-10T12:21:57.000Z","key":""},{"_id":"6486b41971c8767fad774177","id":"mattymchen/refinedweb-3m","author":"mattymchen","disabled":false,"gated":false,"lastModified":"2023-06-12T06:01:04.000Z","likes":10,"trendingScore":1,"private":false,"sha":"21ba1961ddd6c634ac37265ea17a877d2723af5e","description":"\n\t\n\t\t\n\t\tDataset Card for \"refinedweb-3m\"\n\t\n\nMore Information needed\n","downloads":352,"tags":["size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-06-12T05:58:49.000Z","key":""},{"_id":"64897793837ad032c6c25d5b","id":"agkphysics/AudioSet","author":"agkphysics","disabled":false,"gated":false,"lastModified":"2025-10-16T11:21:24.000Z","likes":99,"trendingScore":1,"private":false,"sha":"0c609e8302cf139307f639c57652032af0a88041","description":"\n\t\n\t\t\n\t\tDataset Card for AudioSet\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nAudioSet is a dataset of 10-second clips from YouTube, annotated into one or more sound categories, following the AudioSet ontology.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\n\naudio-classification: Classify audio clips into categories. The leaderboard is available here\n\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nThe class labels in the dataset are in English.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\nExample instance from the dataset:\n{… See the full description on the dataset page: https://huggingface.co/datasets/agkphysics/AudioSet.","downloads":27556,"paperswithcode_id":"audioset","tags":["task_categories:audio-classification","source_datasets:original","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","audio"],"createdAt":"2023-06-14T08:17:23.000Z","key":""},{"_id":"6489e4d190497fd1a77dc277","id":"PKU-Alignment/PKU-SafeRLHF","author":"PKU-Alignment","disabled":false,"gated":false,"lastModified":"2024-10-18T03:47:09.000Z","likes":191,"trendingScore":1,"private":false,"sha":"9421ffafec3fa40a1f1a7d567b4d525079477ecb","description":"\n\t\n\t\t\n\t\tDataset Card for PKU-SafeRLHF\n\t\n\nWarning: this dataset contains data that may be offensive or harmful. The data are intended for research purposes, especially research that can make models less harmful. The views expressed in the data do not reflect the views of PKU-Alignment Team or any of its members. \n[🏠 Homepage] [🤗 Single Dimension Preference Dataset] [🤗 Q-A Dataset] [🤗 Prompt Dataset]\n\n\t\n\t\n\t\n\t\tCitation\n\t\n\nIf PKU-SafeRLHF has contributed to your work, please consider citing… See the full description on the dataset page: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF.","downloads":9989,"tags":["task_categories:text-generation","language:en","license:cc-by-nc-4.0","size_categories:100K<n<1M","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2406.15513","region:us","safe","safety","ai-safety","llm","lm","human-feedback","rlhf","safe-rlhf"],"createdAt":"2023-06-14T16:03:29.000Z","key":""},{"_id":"648b1ab9164d41f3991fa1cc","id":"WizardLMTeam/WizardLM_evol_instruct_V2_196k","author":"WizardLMTeam","disabled":false,"gated":false,"lastModified":"2024-03-10T01:06:00.000Z","likes":250,"trendingScore":1,"private":false,"sha":"8a7d15a83028b5c93915677704f492839e2675f6","description":"\n\t\n\t\t\n\t\tNews\n\t\n\n\n🔥 🔥 🔥 [08/11/2023] We release WizardMath Models.\n🔥 Our WizardMath-70B-V1.0 model slightly outperforms some closed-source LLMs on the GSM8K, including ChatGPT 3.5, Claude Instant 1 and PaLM 2 540B.\n🔥 Our WizardMath-70B-V1.0 model achieves  81.6 pass@1 on the GSM8k Benchmarks, which is 24.8 points higher than the SOTA open-source LLM.\n🔥 Our WizardMath-70B-V1.0 model achieves  22.7 pass@1 on the MATH Benchmarks, which is 9.2 points higher than the SOTA open-source LLM.… See the full description on the dataset page: https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_V2_196k.","downloads":2197,"tags":["license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2308.09583","arxiv:2304.12244","arxiv:2306.08568","region:us"],"createdAt":"2023-06-15T14:05:45.000Z","key":""},{"_id":"648b583c412f81cc0c1dda27","id":"ai-habitat/habitat_test_scenes","author":"ai-habitat","disabled":false,"gated":false,"lastModified":"2023-09-20T18:49:34.000Z","likes":6,"trendingScore":1,"private":false,"sha":"910c783fb954da8497ea5f811b843a76590ddddc","description":"\n\t\n\t\t\n\t\tHabitat Test Scenes Dataset\n\t\n\nA few lightweight static .glb stages for testing habitat-sim and habitat-lab installation and CI without other datasets.\n\n\t\n\t\t\n\t\tContents:\n\t\n\nskokloster-castle.glb - Scan from Sketchfab\napartment_0.glb - Scan from Replica Dataset (geometry decimated for simulation and memory efficiency)\nvan-gogh-room.glb - Synthetic Asset from Sketchfab\n.navmesh files for simulated agent navigation constraints in Habitat-sim.\n","downloads":1637,"tags":["license:cc-by-nc-4.0","region:us"],"createdAt":"2023-06-15T18:28:12.000Z","key":""},{"_id":"648c6967b010e9fed5fe3137","id":"ParsiAI/FarsTail","author":"ParsiAI","disabled":false,"gated":false,"lastModified":"2025-03-19T08:59:34.000Z","likes":2,"trendingScore":1,"private":false,"sha":"bc924588a4f73c4d3b590ba9d603584f074b848d","citation":"\\@article{amirkhani2020farstail,\n  title={FarsTail: A Persian Natural Language Inference Dataset},\n  author={Hossein Amirkhani, Mohammad Azari Jafari, Azadeh Amirak, Zohreh Pourjafari, Soroush Faridan Jahromi, and Zeinab Kouhkan},\n  journal={arXiv preprint arXiv:2009.08820},\n  year={2020}\n}","description":"\\\\\\\\\\\\\\A Persian Natural Language Inference Dataset","downloads":23,"tags":["task_categories:text-classification","language:fa","license:apache-2.0","size_categories:1K<n<10K","arxiv:2009.08820","region:us"],"createdAt":"2023-06-16T13:53:43.000Z","key":""},{"_id":"649431d34b687633b5677c76","id":"MikhailT/cmu-arctic","author":"MikhailT","disabled":false,"gated":false,"lastModified":"2023-06-23T09:07:03.000Z","likes":4,"trendingScore":1,"private":false,"sha":"b88ad416e092c4dfdffc6071f4b8e2d057a160f9","description":"\n\t\n\t\t\n\t\tCMU Arctic Dataset\n\t\n\n","downloads":391,"tags":["language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-06-22T11:34:43.000Z","key":""},{"_id":"649444227853dd12c3bbadd8","id":"Amod/mental_health_counseling_conversations","author":"Amod","disabled":false,"gated":false,"lastModified":"2025-11-25T20:32:10.000Z","likes":487,"trendingScore":1,"private":false,"sha":"d7e86f0813c5690181b41f97403c3674aa55dcef","description":"\n\n\t\n\t\t\n\t\tAmod/mental_health_counseling_conversations\n\t\n\nThis dataset is a compilation of high-quality, real one-on-one mental health counseling conversations between individuals and licensed professionals. Each exchange is structured as a clear question–answer pair, making it directly suitable for fine-tuning or instruction-tuning language models that need to handle sensitive, empathetic, and contextually aware dialogue.\nSince its public release in 2023, it has been downloaded over 100,000… See the full description on the dataset page: https://huggingface.co/datasets/Amod/mental_health_counseling_conversations.","downloads":3109,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:other","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","doi:10.57967/hf/1581","region:us","medical"],"createdAt":"2023-06-22T12:52:50.000Z","key":""},{"_id":"6494788c19ec30b47d89106e","id":"QuixiAI/open-instruct-uncensored","author":"QuixiAI","disabled":false,"gated":false,"lastModified":"2023-06-22T18:41:10.000Z","likes":59,"trendingScore":1,"private":false,"sha":"7d2c87d9e37e10cdcac7e3a9064eef3021b8ba30","description":"This is Allen AI's open-instruct dataset.\nIt is used to train the Tulu family of models.\n\nhttps://huggingface.co/allenai/tulu-7b\nhttps://huggingface.co/allenai/tulu-13b\nhttps://huggingface.co/allenai/tulu-30b\nhttps://huggingface.co/allenai/tulu-65b\n\nI have done the following:\n\nDownload the open-instruct repo\nExecute the scripts/prepare_train_data.sh modified to download the \"unfiltered\" version of sharegpt dataset\nMerged data/processed/**/*.jsonl into a single \"open-instruct.jsonl\"\nExecuted my… See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/open-instruct-uncensored.","downloads":451,"tags":["license:apache-2.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","region:us"],"createdAt":"2023-06-22T16:36:28.000Z","key":""},{"_id":"6494b9e097d66dee803d18be","id":"Romjiik/Russian_bank_reviews","author":"Romjiik","disabled":false,"gated":false,"lastModified":"2023-06-22T21:29:37.000Z","likes":6,"trendingScore":1,"private":false,"sha":"2cbd6c0c1dea3e3ec98bb028f141ee5a4684dfa8","description":"\n\t\n\t\t\n\t\tDataset Card for bank reviews dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe dataset is collected from the banki.ru website.\nIt contains customer reviews of various banks. In total, the dataset contains 12399 reviews. \nThe dataset is suitable for sentiment classification.\nThe dataset contains this fields - bank name, username, review title, review text, review time, number of views,\nnumber of comments, review rating set by the user, as well as ratings for special categories… See the full description on the dataset page: https://huggingface.co/datasets/Romjiik/Russian_bank_reviews.","downloads":151,"tags":["task_categories:text-classification","language:ru","size_categories:10K<n<100K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","finance"],"createdAt":"2023-06-22T21:15:12.000Z","key":""},{"_id":"649949a43756a91498f821c0","id":"FreedomIntelligence/alpaca-gpt4-deutsch","author":"FreedomIntelligence","disabled":false,"gated":false,"lastModified":"2023-08-06T08:08:37.000Z","likes":9,"trendingScore":1,"private":false,"sha":"d708b3249a6434ef0329be854f007e5a133d4cfa","description":"The dataset is used in the research related to MultilingualSIFT. \n","downloads":284,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-06-26T08:17:40.000Z","key":""},{"_id":"64999330ed09df36eaf5e8b0","id":"globis-university/aozorabunko-clean","author":"globis-university","disabled":false,"gated":false,"lastModified":"2023-10-27T13:22:32.000Z","likes":46,"trendingScore":1,"private":false,"sha":"42a9c9c0f1d67e6a5554d9bea4201973dc9b049c","description":"\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset provides a convenient and user-friendly format of data from Aozora Bunko (青空文庫), a website that compiles public-domain books in Japan, ideal for Machine Learning applications.\n[For Japanese] 日本語での概要説明を Qiita に記載しました: https://qiita.com/akeyhero/items/b53eae1c0bc4d54e321f\n\n\t\n\t\t\n\t\tMethodology\n\t\n\nThe code to reproduce this dataset is made available on GitHub: globis-org/aozorabunko-exctractor.\n\n\t\n\t\t\n\t\t1. Data collection\n\t\n\nWe firstly downloaded the CSV file that… See the full description on the dataset page: https://huggingface.co/datasets/globis-university/aozorabunko-clean.","downloads":2705,"tags":["task_categories:text-generation","task_categories:text-classification","language:ja","license:cc-by-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-06-26T13:31:28.000Z","key":""},{"_id":"6499abce08e9b100b1dc0795","id":"hssd/hssd-scenes","author":"hssd","disabled":false,"gated":false,"lastModified":"2023-06-26T15:31:31.000Z","likes":10,"trendingScore":1,"private":false,"sha":"05778c2a54a97a827e13b662e48fa6a509d7af00","description":"\n\t\n\t\t\n\t\tHSSD: Habitat Synthetic Scenes Dataset\n\t\n\nThe Habitat Synthetic Scenes Dataset (HSSD) is a human-authored 3D scene dataset that more closely mirrors real scenes than prior datasets.\nOur dataset represents real interiors and contains a diverse set of 211 scenes and more than 18000 models of real-world objects.\n\n","downloads":424,"tags":["language:en","license:cc-by-nc-4.0","region:us","3D scenes","Embodied AI"],"createdAt":"2023-06-26T15:16:30.000Z","key":""},{"_id":"649d0e78722486c3ce7bf0a7","id":"allenai/peS2o","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-10-13T02:53:05.000Z","likes":197,"trendingScore":1,"private":false,"sha":"636a503e44a3ca1b58e01fb61eab0825cd574de0","citation":"@techreport{peS2o,\n    author = {Luca Soldaini and Kyle Lo},\n    year = 2023,\n    title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}},\n    institution = {{Allen Institute for AI}},\n    note = {ODC-By, \\\\url{https://github.com/allenai/pes2o}}\n}","description":"\n  \n\nPretraining Effectively on S2ORC!\n\nThe peS2o dataset is a collection of ~40M creative open-access academic papers,\ncleaned, filtered, and formatted for pre-training of language models. It is derived from\nthe Semantic Scholar Open Research Corpus(Lo et al, 2020), or S2ORC.\nWe release multiple version of peS2o, each with different processing and knowledge cutoff\ndate. We recommend you to use the latest version available.\nIf you use this dataset, please cite:\n@techreport{peS2o,\n    author =… See the full description on the dataset page: https://huggingface.co/datasets/allenai/peS2o.","downloads":11874,"tags":["task_categories:text-generation","task_categories:fill-mask","source_datasets:allenai/s2orc","language:en","license:odc-by","size_categories:10B<n<100B","region:us","biology","chemistry","engineering","computer science","physics","material science","math","psychology","economics","political science","business","geology","sociology","geography","environmental science","art","history","philosophy"],"createdAt":"2023-06-29T04:54:16.000Z","key":""},{"_id":"649e4f4c53777d895072cd81","id":"FreedomIntelligence/evol-instruct-deutsch","author":"FreedomIntelligence","disabled":false,"gated":false,"lastModified":"2023-08-06T08:12:07.000Z","likes":13,"trendingScore":1,"private":false,"sha":"83bb19029ce76b073b9d9ab77e65a41d2bb93e9b","description":"The dataset is used in the research related to MultilingualSIFT. \n","downloads":126,"tags":["size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-06-30T03:43:08.000Z","key":""},{"_id":"64a423b8f0748b5b615fd889","id":"npvinHnivqn/VietnameseDictionary","author":"npvinHnivqn","disabled":false,"gated":false,"lastModified":"2023-07-08T09:13:42.000Z","likes":3,"trendingScore":1,"private":false,"sha":"5eb8b952d1294cb2e7a1e6990b5a0de391ff8615","description":"\nThis dataset includes ~30k Vietnamese words and definitions\n\n","downloads":33,"tags":["language:vi","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-07-04T13:50:48.000Z","key":""},{"_id":"64a426979ffbf22d240dde28","id":"lmsys/mt_bench_human_judgments","author":"lmsys","disabled":false,"gated":false,"lastModified":"2023-07-20T18:28:15.000Z","likes":145,"trendingScore":1,"private":false,"sha":"f7d2896d2cc5d80f8b55c2bbc722613555233c25","description":"\n\t\n\t\t\n\t\tContent\n\t\n\nThis dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions.\nThe 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions. The details of data collection can be found in our paper.\n\n\t\n\t\t\n\t\n\t\n\t\tAgreement Calculation\n\t\n\nThis Colab notebook shows how to compute the… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/mt_bench_human_judgments.","downloads":2430,"tags":["task_categories:question-answering","language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2306.05685","region:us"],"createdAt":"2023-07-04T14:03:03.000Z","key":""},{"_id":"64a5067713645758bbc23181","id":"AsakusaRinne/gaokao_bench","author":"AsakusaRinne","disabled":false,"gated":false,"lastModified":"2023-07-11T02:19:45.000Z","likes":4,"trendingScore":1,"private":false,"sha":"9aaef7bf4aca96cc66ecc305ba323193fe2d2b77","citation":"","description":"","downloads":58,"tags":["size_categories:1K<n<10K","modality:tabular","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-07-05T05:58:15.000Z","key":""},{"_id":"64a58a01404f514d1d2a75fd","id":"imageomics/Heliconius-Collection_Cambridge-Butterfly","author":"imageomics","disabled":false,"gated":false,"lastModified":"2025-10-01T20:42:20.000Z","likes":2,"trendingScore":1,"private":false,"sha":"959a0f7ade8e729ec6717bb0df3d5dd9d4918fc1","description":"\n\t\n\t\t\n\t\tDataset Card for Heliconius Collection (Cambridge Butterfly)\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nSubset of the collection records from Chris Jiggins' research group at the University of Cambridge, collection covers nearly 20 years of field studies. \nThis subset contains approximately 36,189 RGB images of 11,962 specimens (29,134 images of 10,086 specimens across all Heliconius). Many records have both images and locality data. \nMost images were photographed… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/Heliconius-Collection_Cambridge-Butterfly.","downloads":225,"tags":["task_categories:image-classification","language:en","size_categories:10K<n<100K","format:csv","modality:image","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","doi:10.57967/hf/4642","region:us","butterfly","heliconius","dorsal","ventral","RGB","full body","separated wings","mimicry","CV","erato","melpomene","hybrids","cross types","wild","lab-bred","mimic groups"],"createdAt":"2023-07-05T15:19:29.000Z","key":""},{"_id":"64a8e6a9e04e7f92244477d1","id":"nickrosh/Evol-Instruct-Code-80k-v1","author":"nickrosh","disabled":false,"gated":false,"lastModified":"2023-07-11T02:05:26.000Z","likes":249,"trendingScore":1,"private":false,"sha":"3ae930c20d5496e2c8386872d5628c45f6957db4","description":"Open Source Implementation of Evol-Instruct-Code as described in the WizardCoder Paper.\nCode for the intruction generation can be found on Github as Evol-Teacher.\n","downloads":5324,"tags":["license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2306.08568","region:us"],"createdAt":"2023-07-08T04:31:37.000Z","key":""},{"_id":"64ae54c2afb6aa55343bc420","id":"pufanyi/MIMICIT","author":"pufanyi","disabled":false,"gated":"auto","lastModified":"2024-03-28T03:35:16.000Z","likes":49,"trendingScore":1,"private":false,"sha":"7bed19ffea23748eff599fac87695b0e54c3094b","description":"\n\n\n\n\n\n\n    Bo Li*,♠,1 \n    Yuanhan Zhang*,♠,1 \n    Liangyu Chen*,1 \n    Jinghao Wang*,1 \n    Fanyi Pu*,1 \n    \n    Jingkang Yang1 \n    Chunyuan Li2 \n    Ziwei Liu✉,1\n\n\n\n    1S-Lab, Nanyang Technological University \n    2Microsoft Research, Redmond\n    \n    ♠Co-Project Lead \n    * Equal Contribution \n    ✉ Corresponding Author\n    \n\n\nNote 1: To reduce memory consumption during image loading and improve loading speed, we are converting the JSON format of images to the Parquet format. For… See the full description on the dataset page: https://huggingface.co/datasets/pufanyi/MIMICIT.","downloads":636,"tags":["language:en","language:zh","language:es","language:ja","language:fr","language:ko","language:ar","license:mit","size_categories:1M<n<10M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2306.05425","region:us"],"createdAt":"2023-07-12T07:22:42.000Z","key":""},{"_id":"64b2be78727711e4eaba242d","id":"npvinHnivqn/VietEngDictionary","author":"npvinHnivqn","disabled":false,"gated":false,"lastModified":"2023-07-15T15:50:33.000Z","likes":1,"trendingScore":1,"private":false,"sha":"d9d0741ff03c1b4912c4be7cd917ee0c1e253f58","downloads":42,"tags":["task_categories:translation","language:vi","language:en","license:afl-3.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-07-15T15:42:48.000Z","key":""},{"_id":"64b6de8705fc1a1982b6fa26","id":"acul3/blinkdl-rwkv-indonesia","author":"acul3","disabled":false,"gated":false,"lastModified":"2023-07-18T22:23:24.000Z","likes":5,"trendingScore":1,"private":false,"sha":"80825bd608e72941001410c0b291c92b266722ef","downloads":131,"tags":["size_categories:100M<n<1B","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","region:us"],"createdAt":"2023-07-18T18:48:39.000Z","key":""},{"_id":"64b7648b324038715942f15c","id":"FunDialogues/customer-service-grocery-cashier","author":"FunDialogues","disabled":false,"gated":false,"lastModified":"2023-12-18T05:23:35.000Z","likes":4,"trendingScore":1,"private":false,"sha":"a0f38aa93822521bbcc23e4e19694112d6409afe","description":"\n\t\n\t\t\n\t\tThis Dialogue\n\t\n\nComprised of fictitious examples of dialogues between a customer at a grocery store and the cashier. Check out the example below:\n\"id\": 1,\n\"description\": \"Price inquiry\",\n\"dialogue\": \"Customer: Excuse me, could you tell me the price of the apples per pound? Cashier: Certainly! The price for the apples is $1.99 per pound.\"\n\n\n\t\n\t\t\n\t\tHow to Load Dialogues\n\t\n\nLoading dialogues can be accomplished using the fun dialogues library or Hugging Face datasets library.… See the full description on the dataset page: https://huggingface.co/datasets/FunDialogues/customer-service-grocery-cashier.","downloads":37,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:n<1K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","fictitious dialogues","prototyping","customer service"],"createdAt":"2023-07-19T04:20:27.000Z","key":""},{"_id":"64ba755dd4778d87da0680ed","id":"thesven/bengali-ai-train-set-tiny","author":"thesven","disabled":false,"gated":false,"lastModified":"2023-07-21T15:20:22.000Z","likes":1,"trendingScore":1,"private":false,"sha":"bedea291af5b81f48f6ad6a9acab7e3dceb5a328","description":"\n\t\n\t\t\n\t\tDataset Card for \"bengali-ai-train-set-tiny\"\n\t\n\n\n\t\n\t\t\n\t\tWhisper Model Information\n\t\n\n\nModel Homepage: openai/whisper-tiny on Hugging Face\nModel Paper: Robust Speech Recognition via Large-Scale Weak Supervision\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset is designed to help finetune the openai/whisper-tiny model with additional information in the Bengali language. It consists of an additional 11,000 labeled audio samples from the OOD-Speech dataset, specifically designed for… See the full description on the dataset page: https://huggingface.co/datasets/thesven/bengali-ai-train-set-tiny.","downloads":142,"tags":["size_categories:10K<n<100K","format:parquet","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2305.09688","arxiv:2212.04356","region:us"],"createdAt":"2023-07-21T12:09:01.000Z","key":""},{"_id":"64bcf69f4d2052b1aa10fea6","id":"0xDing/wikipedia-cn-20230720-filtered","author":"0xDing","disabled":false,"gated":false,"lastModified":"2023-07-23T10:06:15.000Z","likes":171,"trendingScore":1,"private":false,"sha":"4cef256a3f426ae1d3f6930c8cd59a32d785d99d","description":"本数据集基于中文维基2023年7月20日的dump存档。作为一项以数据为中心的工作，本数据集仅保留了 254,547条 质量较高的词条内容。具体而言：\n\n过滤了Template, Category, Wikipedia, File, Topic, Portal, MediaWiki, Draft, Help等特殊类型的词条\n使用启发式的方法和自有的NLU模型过滤了一部分质量较低的词条\n过滤了一部分内容较为敏感或存在争议性的词条。\n进行了简繁转换和习惯用词转换，确保符合中国大陆地区的习惯用词。\n\nThis dataset is based on the Chinese Wikipedia dump archive from July 20th, 2023. As a data-centric effort, the dataset retains 254,574 high-quality entries. Specifically:\n\nEntries of special types such as Template, Category, Wikipedia, File, Topic… See the full description on the dataset page: https://huggingface.co/datasets/0xDing/wikipedia-cn-20230720-filtered.","downloads":2292,"tags":["task_categories:text-generation","language:zh","license:cc-by-sa-3.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","wikipedia"],"createdAt":"2023-07-23T09:45:03.000Z","key":""},{"_id":"64bd648b891751ce9b1b3ed1","id":"iamtarun/code_instructions_120k_alpaca","author":"iamtarun","disabled":false,"gated":false,"lastModified":"2023-07-27T15:49:10.000Z","likes":65,"trendingScore":1,"private":false,"sha":"31f725b2d714c1b4f038e80fbaa6b977870a50b7","description":"\n\t\n\t\t\n\t\tDataset Card for code_instructions_120k_alpaca\n\t\n\nThis dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style. Refer to the original source here. \n","downloads":1396,"tags":["task_categories:text-generation","task_categories:question-answering","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","code"],"createdAt":"2023-07-23T17:34:03.000Z","key":""},{"_id":"64be50954b4ff0d509698f72","id":"iamtarun/python_code_instructions_18k_alpaca","author":"iamtarun","disabled":false,"gated":false,"lastModified":"2023-07-27T15:51:36.000Z","likes":346,"trendingScore":1,"private":false,"sha":"7cae181e29701a8663a07a3ea43c8e105b663ba1","description":"\n\t\n\t\t\n\t\tDataset Card for python_code_instructions_18k_alpaca\n\t\n\nThe dataset contains problem descriptions and code in python language.\nThis dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style. Refer to the source here.\n","downloads":17917,"tags":["task_categories:question-answering","task_categories:text-generation","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","code"],"createdAt":"2023-07-24T10:21:09.000Z","key":""},{"_id":"64c0e4f2a73ac49b2c198596","id":"dimanchkek/Deepfacelive-DFM-Models","author":"dimanchkek","disabled":false,"gated":false,"lastModified":"2025-01-10T21:35:51.000Z","likes":33,"trendingScore":1,"private":false,"sha":"ea50d8335b08ae2c13fde1ae3cf41084a68f257a","description":"\n\t\n\t\t\n\t\tDescription\n\t\n\n\n\nHere you can find files for DeepFaceLab(It's back!) and DeepFaceLive. All sources and active community members are listed below.\n\n\t\n\t\t\n\t\tDisclaimer\n\t\n\n\nThe author of this repository makes no claim to the data uploaded here other than that created by himself. Feel free to open a discussion for me to mention your contacts if I haven't done so. \n\n\t\n\t\t\n\t\tQuick usage guide\n\t\n\nTo use the models presented in the repository, you will need installed DeepFaceLive.\n\n.dfm models… See the full description on the dataset page: https://huggingface.co/datasets/dimanchkek/Deepfacelive-DFM-Models.","downloads":2537,"tags":["license:gpl-3.0","region:us"],"createdAt":"2023-07-26T09:18:42.000Z","key":""},{"_id":"64c2bad6b374f07cb4cace6c","id":"HydraLM/physics_dataset_alpaca","author":"HydraLM","disabled":false,"gated":false,"lastModified":"2023-07-27T18:43:43.000Z","likes":9,"trendingScore":1,"private":false,"sha":"250fb0c38cedb0b6e1996fc48d1de9e4ed5cf02f","description":"\n\t\n\t\t\n\t\tDataset Card for \"physics_dataset_alpaca\"\n\t\n\nMore Information needed\n","downloads":43,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-07-27T18:43:34.000Z","key":""},{"_id":"64c523313f3387bcfa19d940","id":"zai-org/LongBench","author":"zai-org","disabled":false,"gated":false,"lastModified":"2024-12-18T08:44:33.000Z","likes":184,"trendingScore":1,"private":false,"sha":"5e628be450b7e67fb7ae6e201bd6d8f7056f7672","description":"LongBench is a comprehensive benchmark for multilingual and multi-task purposes, with the goal to fully measure and evaluate the ability of pre-trained language models to understand long text. This dataset consists of twenty different tasks, covering key long-text application scenarios such as multi-document QA, single-document QA, summarization, few-shot learning, synthetic tasks, and code completion.","downloads":59582,"tags":["task_categories:question-answering","task_categories:text-generation","task_categories:summarization","task_categories:text-classification","language:en","language:zh","size_categories:1K<n<10K","arxiv:2308.14508","arxiv:2108.00573","arxiv:1712.07040","arxiv:2105.03011","arxiv:2104.02112","arxiv:2104.05938","arxiv:2305.05280","arxiv:2303.09752","arxiv:1910.10683","arxiv:2306.14893","arxiv:2306.03091","region:us","Long Context"],"createdAt":"2023-07-29T14:33:21.000Z","key":""},{"_id":"64c616449cff58f22c3ad415","id":"rojagtap/bookcorpus","author":"rojagtap","disabled":false,"gated":false,"lastModified":"2023-07-30T09:45:58.000Z","likes":22,"trendingScore":1,"private":false,"sha":"7600e0b87dbf25054fe4edb05035cc98fa9bf128","downloads":1195,"tags":["license:mit","size_categories:10M<n<100M","format:text","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-07-30T07:50:28.000Z","key":""},{"_id":"64cb32838769a4019f92b223","id":"togethercomputer/llama-instruct","author":"togethercomputer","disabled":false,"gated":false,"lastModified":"2023-08-18T05:04:06.000Z","likes":28,"trendingScore":1,"private":false,"sha":"f8966a32f98b73fb797bb1ff8af8687920043d90","description":"\n\t\n\t\t\n\t\tllama-instruct\n\t\n\nThis dataset was used to finetune Llama-2-7B-32K-Instruct.\nWe follow the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, Orca\n— producing instructions by querying a powerful LLM, which in our case, is the Llama-2-70B-Chat model released by Meta. \nTo build Llama-2-7B-32K-Instruct, we collect instructions from 19K human inputs extracted from ShareGPT-90K (only using human inputs, not ChatGPT outputs).\nThe actual script handles multi-turn conversations… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/llama-instruct.","downloads":61,"tags":["language:en","license:llama2","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2304.12244","region:us"],"createdAt":"2023-08-03T04:52:19.000Z","key":""},{"_id":"64cb644000377e284874fc7c","id":"Supabase/wikipedia-en-embeddings","author":"Supabase","disabled":false,"gated":false,"lastModified":"2023-08-03T12:12:39.000Z","likes":16,"trendingScore":1,"private":false,"sha":"a46c85b41829b8c94ec939e1402c3f45133cd717","description":"\n\t\n\t\t\n\t\tOpenAI, all-MiniLM-L6-v2, GTE-small embeddings for Wikipedia Simple English\n\t\n\nTexts and OpenAI embeddings are genereted by Stephan Sturges, big thanks for sharing this dataset.\nHere we added embeddings for all-MiniLM-L6-v2 and GTE-small.\nTotal 224,482 vectors for each model.\n\n\t\n\t\t\n\t\n\t\n\t\tNotes\n\t\n\nThese are the embeddings and corresponded simplified articles from the wikipedia \"simple english\" dump.\nPlease see wikipedia's licensing for usage information:… See the full description on the dataset page: https://huggingface.co/datasets/Supabase/wikipedia-en-embeddings.","downloads":88,"tags":["language:en","license:mit","size_categories:100K<n<1M","region:us"],"createdAt":"2023-08-03T08:24:32.000Z","key":""},{"_id":"64cc00864726a3f833653161","id":"garage-bAInd/Open-Platypus","author":"garage-bAInd","disabled":false,"gated":false,"lastModified":"2024-01-24T19:09:41.000Z","likes":422,"trendingScore":1,"private":false,"sha":"37141edbdb7826378cce118c46a109b813e1f038","description":"\n\t\n\t\t\n\t\tOpen-Platypus\n\t\n\nThis dataset is focused on improving LLM logical reasoning skills and was used to train the Platypus2 models. It is comprised of the following datasets, which were filtered using keyword search and then Sentence Transformers to remove questions with a similarity above 80%:\n\n\t\n\t\t\nDataset Name\nLicense Type\n\n\n\t\t\nPRM800K\nMIT\n\n\nMATH\nMIT\n\n\nScienceQA\nCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International\n\n\nSciBench\nMIT\n\n\nReClor\nNon-commercial\n\n\nTheoremQA\nMIT… See the full description on the dataset page: https://huggingface.co/datasets/garage-bAInd/Open-Platypus.","downloads":8554,"tags":["language:en","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2308.07317","arxiv:2305.20050","arxiv:2305.12524","region:us"],"createdAt":"2023-08-03T19:31:18.000Z","key":""},{"_id":"64ccda8e710645aa7b2e7a20","id":"TitanMLData/arxiv_qa","author":"TitanMLData","disabled":false,"gated":false,"lastModified":"2023-08-04T11:38:53.000Z","likes":4,"trendingScore":1,"private":false,"sha":"cbd9f3684eeeadc96772ba654aec5e158ce2fa32","description":"\n\t\n\t\t\n\t\tArxiv Paper Generative Question Answering\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset is made using ChatGPT (text-davinci-003) to generate Question/Answer pairs from Arxiv papers from this dataset\n\n\t\n\t\t\n\t\tData Fields\n\t\n\n\nTextID: references the datarow (paper) in the arxiv summarizer dataset\nQuestion: question based on the text\nResponse: answer\nText: Full text with the paper as 'context:' and and the question appended as 'question:'. Used for generative question answering usign language… See the full description on the dataset page: https://huggingface.co/datasets/TitanMLData/arxiv_qa.","downloads":38,"tags":["task_categories:question-answering","language:en","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-08-04T11:01:34.000Z","key":""},{"_id":"64d1091fbc6c9c8bc0617bb8","id":"Universal-NER/Pile-NER-definition","author":"Universal-NER","disabled":false,"gated":false,"lastModified":"2023-08-07T17:08:06.000Z","likes":20,"trendingScore":1,"private":false,"sha":"c2772f9143d6c6b3de9bf068522fb59dcd9b3b35","description":"\n\t\n\t\t\n\t\tIntro\n\t\n\nPile-NER-definition is a set of GPT-generated data for named entity recognition using the definition-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.\n\n\t\n\t\t\n\t\tLicense\n\t\n\nAttribution-NonCommercial 4.0 International\n","downloads":104,"tags":["language:en","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-08-07T15:09:19.000Z","key":""},{"_id":"64d5e3e26682b2e845f6db9c","id":"docling-project/USPTO-30K","author":"docling-project","disabled":false,"gated":false,"lastModified":"2023-08-24T08:28:32.000Z","likes":10,"trendingScore":1,"private":false,"sha":"58b3a773f01816d62126f11dec4d5e033e8ebf9d","description":"\n\t\n\t\t\n\t\tUSPTO-30K\n\t\n\nUSPTO-30K is the benchmark dataset introduced in MolGrapher: Graph-based Visual Recognition of Chemical Structures.\nExisting benchmarks for Optical Chemical Structure Recognition have some limitations. \nBeing created using only a few documents, they contain batches of very similar molecules. For example in a patent, a molecule could typically be displayed together with all the substituent of one particular substructure, resulting in large batches of almost identical… See the full description on the dataset page: https://huggingface.co/datasets/docling-project/USPTO-30K.","downloads":353,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-08-11T07:31:46.000Z","key":""},{"_id":"64de6da13d3a7519f17c623c","id":"ProgramComputer/voxceleb","author":"ProgramComputer","disabled":false,"gated":false,"lastModified":"2026-02-17T14:30:59.000Z","likes":115,"trendingScore":1,"private":false,"sha":"b8a8825c87582831cc217b6ff84a36558ced84d6","description":"This dataset includes both VoxCeleb and VoxCeleb2\nThe copyright remains with the original owners of the audiovisual\n\n\t\n\t\t\n\t\tMultipart Zips\n\t\n\nAlready joined zips for convenience but these specified files are NOT part of the original datasets\nvox2_mp4_1.zip - vox2_mp4_6.zip \nvox2_aac_1.zip - vox2_aac_2.zip \n\n\t\n\t\t\n\t\tJoining Zip\n\t\n\ncat vox1_dev* > vox1_dev_wav.zip\n\ncat vox2_dev_aac* > vox2_aac.zip\n\ncat vox2_dev_mp4* > vox2_mp4.zip\n\n\n\t\n\t\t\n\t\tCitation Information\n\t\n\n@article{Nagrani19,\n    author =… See the full description on the dataset page: https://huggingface.co/datasets/ProgramComputer/voxceleb.","downloads":7378,"tags":["task_categories:automatic-speech-recognition","task_categories:audio-classification","task_categories:image-classification","task_categories:video-classification","license:cc-by-4.0","size_categories:100K<n<1M","arxiv:1706.08612","doi:10.57967/hf/0999","region:us"],"createdAt":"2023-08-17T18:57:37.000Z","key":""},{"_id":"64e0ab55ead049c7b8a04be5","id":"pourmand1376/persian-qa-translated","author":"pourmand1376","disabled":false,"gated":false,"lastModified":"2023-08-19T11:52:23.000Z","likes":4,"trendingScore":1,"private":false,"sha":"54b6cc08a0282f5f1361fbb3fe23a095cce9db79","description":"\n\t\n\t\t\n\t\tDataset Card for \"persian-qa-translated\"\n\t\n\nMore Information needed\n","downloads":25,"tags":["task_categories:question-answering","task_categories:translation","task_categories:text-generation","language:fa","language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-08-19T11:45:25.000Z","key":""},{"_id":"64e6c8abc3b2443fb3ef3f06","id":"LetsChurch/bible-embeddings","author":"LetsChurch","disabled":false,"gated":false,"lastModified":"2026-03-11T13:44:27.000Z","likes":5,"trendingScore":1,"private":false,"sha":"5d22205e06abedbcf7c33e217b29d6f01e27a7b0","description":"\n\t\n\t\t\n\t\tBible Embeddings\n\t\n\nA comprehensive tool for generating and evaluating Bible verse embeddings using various state-of-the-art embedding models. This project supports both commercial APIs (OpenAI, Google Gemini, Voyage AI) and open-source models (HuggingFace sentence-transformers) for semantic search across biblical texts.\n\n\t\n\t\t\n\t\tSetup\n\t\n\nThis project is managed with uv. Make sure you have uv installed, then set up the project:\n# Install dependencies\nuv sync\n\n# Install specific provider… See the full description on the dataset page: https://huggingface.co/datasets/LetsChurch/bible-embeddings.","downloads":431,"tags":["region:us"],"createdAt":"2023-08-24T03:04:11.000Z","key":""},{"_id":"64e77c453e1237c874d34ffe","id":"bitext/Bitext-customer-support-llm-chatbot-training-dataset","author":"bitext","disabled":false,"gated":false,"lastModified":"2024-07-18T18:19:33.000Z","likes":176,"trendingScore":1,"private":false,"sha":"430d1a89bd93bd1fa23c16f29dd53e73f0087443","description":"\n\t\n\t\t\n\t\tBitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.","downloads":5240,"tags":["task_categories:question-answering","task_categories:table-question-answering","language:en","license:cdla-sharing-1.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","question-answering","llm","chatbot","customer-support","conversional-ai","generative-ai","natural-language-understanding","fine-tuning","Retail"],"createdAt":"2023-08-24T15:50:29.000Z","key":""},{"_id":"64e85e5fa7c3693a8be34126","id":"alayaran/bodo-monolingual-dataset","author":"alayaran","disabled":false,"gated":false,"lastModified":"2023-09-16T16:13:42.000Z","likes":1,"trendingScore":1,"private":false,"sha":"985881584c7ff5e6c7ef05d9282188eb38cc8c35","description":"# First Install datasets library\n\npip install datasets\n\n\nfrom datasets import load_dataset\n\ntrain = load_dataset(\"alayaran/bodo-monolingual-dataset\", \"unshuffled_deduplicated_no\", split=\"train\")\ntest = load_dataset(\"alayaran/bodo-monolingual-dataset\", \"unshuffled_deduplicated_no\", split=\"test\")\n\n# print the first five entries from the dataset array of trai and test set\nprint(train['text'][:5])\n[\"मदि सरकारा जारिमिनारि हाबाफारि मावफूंदों ,  1  कौटि नख'राव दैनि कानेक्सन होबाय\",\n\"दिल्ली / जयपुर… See the full description on the dataset page: https://huggingface.co/datasets/alayaran/bodo-monolingual-dataset.","downloads":53,"tags":["task_categories:text-generation","language:brx","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-08-25T07:55:11.000Z","key":""},{"_id":"64eddebbd0a1f602423a9df0","id":"OpenAssistant/OASST-DE","author":"OpenAssistant","disabled":false,"gated":false,"lastModified":"2023-11-13T10:24:29.000Z","likes":14,"trendingScore":1,"private":false,"sha":"e72f919251afc134b72602612a2a7a04eb5f2bd5","description":"\n\t\n\t\t\n\t\tGerman OpenAssistant Conversations Dataset (OASST-DE)\n\t\n\nWith the goal of advancing open-source, german-language LLM research, we present \nOASST-DE: a high quality subset of a recent (25.08.23) dump from the OpenAssistant website\ntranslated to German using the GPT-3.5 API. More details on how the dataset was filtered and translated under dataset creation.\nFor more details on the OpenAssistant Project, look at the first OASST dataset (OASST1), the Open-Assistant GitHub repo\nor our… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/OASST-DE.","downloads":155,"tags":["language:de","license:apache-2.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2304.07327","region:us"],"createdAt":"2023-08-29T12:04:11.000Z","key":""},{"_id":"64f22ea1beece91ec8f26e5d","id":"SuryaKrishna02/aya-telugu-poems","author":"SuryaKrishna02","disabled":false,"gated":false,"lastModified":"2024-01-24T06:13:06.000Z","likes":10,"trendingScore":1,"private":false,"sha":"5f3eff71f0ca03180d3a38b3d99397b8d1dc51d0","description":"\n\t\n\t\t\n\t\tSummary\n\t\n\naya-telugu-poems is an open source dataset of instruct-style records generated by webscraping a Telugu poems website. This was created as part of Aya Open Science Initiative from Cohere For AI.\nThis dataset can be used for any purpose, whether academic or commercial, under the terms of the Apache 2.0 License.\nSupported Tasks:\n\nTraining LLMs\nSynthetic Data Generation\nData Augmentation\n\nLanguages: Telugu Version: 1.0\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Overview\n\t\n\naya-telugu-poems is a corpus… See the full description on the dataset page: https://huggingface.co/datasets/SuryaKrishna02/aya-telugu-poems.","downloads":165,"tags":["task_categories:text-generation","task_ids:language-modeling","annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:te","license:apache-2.0","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","literature","poems"],"createdAt":"2023-09-01T18:34:09.000Z","key":""},{"_id":"64f562e368dde13639974a42","id":"harouzie/vi_question_generation","author":"harouzie","disabled":false,"gated":false,"lastModified":"2023-09-04T05:02:36.000Z","likes":2,"trendingScore":1,"private":false,"sha":"515c2cf2e64d034d4796f98b78c7c16d1ce81b78","downloads":36,"tags":["task_categories:question-answering","language:vi","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-09-04T04:53:55.000Z","key":""},{"_id":"64f72b773e89b329daa77f8e","id":"AndressaStefany/bug-reports","author":"AndressaStefany","disabled":false,"gated":false,"lastModified":"2023-09-10T01:55:31.000Z","likes":2,"trendingScore":1,"private":false,"sha":"7e91ceba647f2b4b3de6ce4e455ff286baa74a19","downloads":61,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-09-05T13:21:59.000Z","key":""},{"_id":"64fc177dd268b2f1adb97ec9","id":"lavita/ChatDoctor-HealthCareMagic-100k","author":"lavita","disabled":false,"gated":false,"lastModified":"2023-09-09T07:40:38.000Z","likes":112,"trendingScore":1,"private":false,"sha":"505443eac4e99ccedeffbb6f640061223d1d4bb3","description":"\n\t\n\t\t\n\t\tDataset Card for \"ChatDoctor-HealthCareMagic-100k\"\n\t\n\nMore Information needed\n","downloads":4767,"tags":["size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-09-09T06:58:05.000Z","key":""},{"_id":"65036d0869558920757dd01e","id":"EleutherAI/hendrycks_math","author":"EleutherAI","disabled":false,"gated":false,"lastModified":"2025-01-12T19:39:12.000Z","likes":106,"trendingScore":1,"private":false,"sha":"21a5633873b6a120296cce3e2df9d5550074f4a3","description":"\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nMATH dataset from https://github.com/hendrycks/math\n\n\t\n\t\t\n\t\tCitation Information\n\t\n\n@article{hendrycksmath2021,\n  title={Measuring Mathematical Problem Solving With the MATH Dataset},\n  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n  journal={NeurIPS},\n  year={2021}\n}\n\n","downloads":102122,"tags":["license:mit","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-09-14T20:28:56.000Z","key":""},{"_id":"650c7be62a602ba34904a362","id":"meta-math/MetaMathQA","author":"meta-math","disabled":false,"gated":false,"lastModified":"2023-12-21T01:35:53.000Z","likes":463,"trendingScore":1,"private":false,"sha":"aa4f34d3d2d3231299b5b03d9b3e5a20da45aa18","description":"View the project page:\nhttps://meta-math.github.io/\nsee our paper at https://arxiv.org/abs/2309.12284\n\n\t\n\t\t\n\t\tNote\n\t\n\nAll MetaMathQA data are augmented from the training sets of GSM8K and MATH. \nNone of the augmented data is from the testing set.\nYou can check the original_question in meta-math/MetaMathQA, each item is from the GSM8K or MATH train set.\n\n\t\n\t\t\n\t\tModel Details\n\t\n\nMetaMath-Mistral-7B is fully fine-tuned on the MetaMathQA datasets and based on the powerful Mistral-7B model. It is… See the full description on the dataset page: https://huggingface.co/datasets/meta-math/MetaMathQA.","downloads":45519,"tags":["license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2309.12284","region:us","math","math-qa"],"createdAt":"2023-09-21T17:22:46.000Z","key":""},{"_id":"650d05ebdcf6501f3aee1fff","id":"Duxiaoman-DI/FinanceIQ","author":"Duxiaoman-DI","disabled":false,"gated":false,"lastModified":"2023-09-22T14:50:42.000Z","likes":49,"trendingScore":1,"private":false,"sha":"fca1d9ceb6925041264d211b9ddf7264e8145248","downloads":203,"tags":["license:cc-by-nc-sa-4.0","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-09-22T03:11:39.000Z","key":""},{"_id":"65132d53330c55fdc533fe75","id":"mindchain/wikitext2","author":"mindchain","disabled":false,"gated":false,"lastModified":"2023-09-26T19:13:55.000Z","likes":11,"trendingScore":1,"private":false,"sha":"78f87cee3758b0789c10049897260ed34b9b9bb5","description":"\n\t\n\t\t\n\t\tDataset Card for \"wikitext\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\n The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\nCompared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over\n110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.","downloads":343,"paperswithcode_id":"wikitext-2","tags":["task_categories:text-generation","task_categories:fill-mask","task_ids:language-modeling","task_ids:masked-language-modeling","annotations_creators:no-annotation","language_creators:crowdsourced","multilinguality:monolingual","source_datasets:original","language:en","license:cc-by-sa-3.0","license:gfdl","size_categories:1M<n<10M","arxiv:1609.07843","region:us"],"createdAt":"2023-09-26T19:13:23.000Z","key":""},{"_id":"651389c2a6755fc9773f321a","id":"shengqin/web-attacks-long","author":"shengqin","disabled":false,"gated":false,"lastModified":"2023-10-03T07:50:07.000Z","likes":3,"trendingScore":1,"private":false,"sha":"6b4f4dc85bdedc0d5356c25aaddafa613aed88ed","downloads":30,"tags":["size_categories:10K<n<100K","format:csv","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-09-27T01:47:46.000Z","key":""},{"_id":"651fe4f46f1df5466364ea8c","id":"erenfazlioglu/turkishvoicedataset","author":"erenfazlioglu","disabled":false,"gated":false,"lastModified":"2024-11-26T15:30:34.000Z","likes":44,"trendingScore":1,"private":false,"sha":"13238c462f32f6c2fd8293f732f2eac2a03ce48c","description":"\n\t\n\t\t\n\t\tDataset Card for \"turkishneuralvoice\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Overview\n\t\n\nDataset Name: Turkish Neural Voice\nDescription: This dataset contains Turkish audio samples generated using Microsoft Text to Speech services. The dataset includes audio files and their corresponding transcriptions.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\nConfigs:\n\ndefault\n\nData Files:\n\nSplit: train\nPath: data/train-*\n\n\n\nDataset Info:\n\nFeatures:\naudio: Audio file\ntranscription: Corresponding text transcription\n\n\nSplits:\ntrain… See the full description on the dataset page: https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset.","downloads":196,"tags":["task_categories:text-to-speech","language:tr","license:cc-by-nc-4.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","audio","text-to-speech","turkish","synthetic-voice"],"createdAt":"2023-10-06T10:44:04.000Z","key":""},{"_id":"652083cf728c0b6dc7fa1805","id":"berkeley-nest/Nectar","author":"berkeley-nest","disabled":false,"gated":false,"lastModified":"2024-03-20T04:17:46.000Z","likes":295,"trendingScore":1,"private":false,"sha":"3c6b4c47fa1cc38869f9f32dce1699f7abad8b06","description":"\n\t\n\t\t\n\t\tDataset Card for Nectar\n\t\n\n\nDeveloped by: Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao.\nLicense: Apache-2.0 license under the condition that the dataset is not used to compete with OpenAI\n\nNectar is the first high-quality 7-wise comparison dataset, generated through GPT-4-based ranking. Nectar contains diverse chat prompts, high-quality and diverse responses, and accurate ranking labels. Nectar's prompts are an amalgamation of diverse sources, including… See the full description on the dataset page: https://huggingface.co/datasets/berkeley-nest/Nectar.","downloads":1433,"tags":["language:en","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","RLHF","RLAIF","reward model"],"createdAt":"2023-10-06T22:01:51.000Z","key":""},{"_id":"65267b22cdeab5e88788176a","id":"keivalya/MedQuad-MedicalQnADataset","author":"keivalya","disabled":false,"gated":false,"lastModified":"2023-10-11T10:50:41.000Z","likes":131,"trendingScore":1,"private":false,"sha":"5b0961fbaa6d7f9c344c5d59c29943fb900c2eca","description":"\n\t\n\t\t\n\t\tReference:\n\t\n\n\n\"A Question-Entailment Approach to Question Answering\". Asma Ben Abacha and Dina Demner-Fushman. BMC Bioinformatics, 2019.\n\n","downloads":3052,"tags":["task_categories:question-answering","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-11T10:38:26.000Z","key":""},{"_id":"65286a3df0042c8301ca7b3e","id":"librarian-bots/model_cards_with_metadata","author":"librarian-bots","disabled":false,"gated":false,"lastModified":"2026-06-30T03:18:42.000Z","likes":24,"trendingScore":1,"private":false,"sha":"22a01a25977195c8f50139c5a08df69722e0826e","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for Hugging Face Hub Model Cards\n\t\n\nThis datasets consists of model cards for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. \nThis dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub.\nThis dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub.… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata.","downloads":762,"tags":["task_categories:text-retrieval","size_categories:100K<n<1M","format:parquet","format:optimized-parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","ethics"],"createdAt":"2023-10-12T21:50:53.000Z","key":""},{"_id":"6529121c94bb032403e25d49","id":"distil-whisper/earnings22","author":"distil-whisper","disabled":false,"gated":false,"lastModified":"2023-10-13T12:00:56.000Z","likes":18,"trendingScore":1,"private":false,"sha":"0a034f9ed86d33a3859d9025d3e621cf243773ab","description":"\n\t\n\t\t\n\t\tDataset Card for Earnings 22\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nEarnings-22 provides a free-to-use benchmark of real-world, accented audio to bridge academic and industrial research.\nThis dataset contains 125 files totalling roughly 119 hours of English language earnings calls from global countries. \nThis dataset provides the full audios, transcripts, and accompanying metadata such as ticker symbol, headquarters country, \nand our defined \"Language Region\".\n\n\t\n\t\t\n\t\n\t\n\t\tSupported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/distil-whisper/earnings22.","downloads":2611,"tags":["size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2203.15591","region:us"],"createdAt":"2023-10-13T09:47:08.000Z","key":""},{"_id":"652a8fe13a416e1f21996416","id":"hezarai/persian-license-plate-v1","author":"hezarai","disabled":false,"gated":false,"lastModified":"2025-03-04T11:01:20.000Z","likes":10,"trendingScore":1,"private":false,"sha":"3ca8609f245f0aab2c4f1d93986ecebc6b923081","description":"\nDataset is downloaded from here which was provided at Amirkabir University of Technology.\nThe dataset is labeled by the authors.\nExperimental results show that the fine-tuned model works well in Persian License Plate.\n\n\n\t\n\t\t\n\t\tUsage\n\t\n\nYou can download the dataset easily using HF datasets package in Python:\n!pip install datasets\n\nfrom datasets import load_dataset\ndataset = load_dataset(\"hezarai/persian-license-plate-v1\", split=\"train\")  # Other splits: validation, test\nprint(dataset[0])\n\n","downloads":363,"tags":["task_categories:image-to-text","language:fa","size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-14T12:56:01.000Z","key":""},{"_id":"652c26161a3250bbfe6b96d0","id":"AI4Math/MathVista","author":"AI4Math","disabled":false,"gated":false,"lastModified":"2024-02-11T23:09:05.000Z","likes":220,"trendingScore":1,"private":false,"sha":"2b6ad69445fbb5695c9b165475e8decdbeb97747","description":"\n\t\n\t\t\n\t\tDataset Card for MathVista\n\t\n\n\nDataset Description\nPaper Information\nDataset Examples\nLeaderboard\nDataset Usage\nData Downloading\nData Format\nData Visualization\nData Source\nAutomatic Evaluation\n\n\nLicense\nCitation\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nMathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical… See the full description on the dataset page: https://huggingface.co/datasets/AI4Math/MathVista.","downloads":13915,"paperswithcode_id":"mathvista","tags":["task_categories:multiple-choice","task_categories:question-answering","task_categories:visual-question-answering","task_categories:text-classification","task_ids:multiple-choice-qa","task_ids:closed-domain-qa","task_ids:open-domain-qa","task_ids:visual-question-answering","task_ids:multi-class-classification","annotations_creators:expert-generated","annotations_creators:found","language_creators:expert-generated","language_creators:found","multilinguality:monolingual","source_datasets:original","language:en","language:zh","language:fa","license:cc-by-sa-4.0","size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2310.02255","region:us","multi-modal-qa","math-qa","figure-qa","geometry-qa","math-word-problem","textbook-qa","vqa","arithmetic-reasoning","statistical-reasoning","algebraic-reasoning","geometry-reasoning","numeric-common-sense","scientific-reasoning","logical-reasoning","geometry-diagram","synthetic-scene","chart","plot","scientific-figure","table","function-plot","abstract-scene","puzzle-test","document-image","medical-image","mathematics","science","chemistry","biology","physics","engineering","natural-science"],"createdAt":"2023-10-15T17:49:10.000Z","key":""},{"_id":"653185d9d0787a1405888f1f","id":"lavita/MedQuAD","author":"lavita","disabled":false,"gated":false,"lastModified":"2023-12-22T22:28:40.000Z","likes":22,"trendingScore":1,"private":false,"sha":"84ea67f83cec9692ad254eaa02c9731b24ecfe4c","description":"\n\t\n\t\t\n\t\tDataset Card for \"MedQuAD\"\n\t\n\nThis dataset is the converted version of MedQuAD. Some notes about the data:\n\nMultiple values in the umls_cui, umls_semantic_types, synonyms columns are separated by | character.\nAnswers for [GARD, MPlusHerbsSupplements, ADAM, MPlusDrugs] sources (31,034 records) are removed from the original dataset to respect the MedlinePlus copyright.\nUMLS (umls): Unified Medical Language System\nCUI (cui): Concept Unique Identifier\n\n\n\t\n\t\n\t\n\t\tQuestion type discrepancies… See the full description on the dataset page: https://huggingface.co/datasets/lavita/MedQuAD.","downloads":2705,"tags":["task_categories:question-answering","language:en","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","medical"],"createdAt":"2023-10-19T19:39:05.000Z","key":""},{"_id":"6532131b5c959f0f95d66e2f","id":"reczoo/Criteo_x1","author":"reczoo","disabled":false,"gated":false,"lastModified":"2023-12-23T06:23:14.000Z","likes":2,"trendingScore":1,"private":false,"sha":"4ecadde5eb8473e6ed6c570d9fb57e71079c0ef4","description":"\n\t\n\t\t\n\t\tCriteo_x1\n\t\n\n\nDataset description:\nThe Criteo dataset is a widely-used benchmark dataset for CTR prediction, which contains about one week of click-through data for display advertising. It has 13 numerical feature fields and 26 categorical feature fields. Following the AFN work, we randomly split the data into 7:2:1* as the training set, validation set, and test set, respectively. \nThe dataset statistics are summarized as follows:\n\n\t\n\t\t\nDataset Split\nTotal\n#Train\n#Validation\n#Test… See the full description on the dataset page: https://huggingface.co/datasets/reczoo/Criteo_x1.","downloads":771,"tags":["size_categories:10M<n<100M","format:csv","modality:tabular","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2304.00902","region:us"],"createdAt":"2023-10-20T05:41:47.000Z","key":""},{"_id":"6535951011c8c094d4f7e87d","id":"Oumar199/French_Wolof_Various_Parallel_Corpus","author":"Oumar199","disabled":false,"gated":false,"lastModified":"2024-09-11T12:32:17.000Z","likes":2,"trendingScore":1,"private":false,"sha":"f179c8ab35c667b640b501ed7aa6808a1f075b08","downloads":43,"tags":["task_categories:translation","language:fr","language:wo","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-22T21:33:04.000Z","key":""},{"_id":"6537885875990e188b075c3d","id":"pingzhili/uspto-50k","author":"pingzhili","disabled":false,"gated":false,"lastModified":"2023-12-18T04:24:59.000Z","likes":5,"trendingScore":1,"private":false,"sha":"8296db0b791758743cb8efb5d62cc10d03c2b988","description":"\n\t\n\t\t\n\t\tDataset Card for \"uspto-50k\"\n\t\n\nMore Information needed\n","downloads":690,"tags":["size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-24T09:03:20.000Z","key":""},{"_id":"6537ac17aa214745cbc1c3e1","id":"lzy337/attack_data_hf","author":"lzy337","disabled":false,"gated":false,"lastModified":"2023-10-24T12:24:10.000Z","likes":1,"trendingScore":1,"private":false,"sha":"1b81962ba993a0c5a9ce484c4c250f334be231af","description":"Toxicity contail three types of data. 1. from realtoxicty prompt .2 response from gpt3.5 generation as prompt 3. same as 2 but it comes from gpt4\n","downloads":80,"tags":["size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-24T11:35:51.000Z","key":""},{"_id":"6537c02f46cdab563db94d0c","id":"polyhedralai/mining_concepts","author":"polyhedralai","disabled":false,"gated":false,"lastModified":"2023-10-24T13:03:03.000Z","likes":5,"trendingScore":1,"private":false,"sha":"2ae644bc526954a01ec2d467fae3a2ba2221ff58","downloads":27,"tags":["license:mit","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-24T13:01:35.000Z","key":""},{"_id":"6539bda9e279d2fc80bfc994","id":"togethercomputer/RedPajama-Data-V2","author":"togethercomputer","disabled":false,"gated":false,"lastModified":"2024-11-21T09:33:17.000Z","likes":404,"trendingScore":1,"private":false,"sha":"aa033f6b6480c445557e9118283993403f069f6a","description":"RedPajama V2: an Open Dataset for Training Large Language Models","downloads":9619,"tags":["task_categories:text-generation","language:en","language:de","language:fr","language:es","language:it","arxiv:2302.03169","arxiv:2302.13971","arxiv:2204.02311","arxiv:2112.06905","arxiv:1910.10683","arxiv:2305.13169","arxiv:2306.01116","arxiv:2112.11446","arxiv:2411.12372","region:us"],"createdAt":"2023-10-26T01:15:21.000Z","key":""},{"_id":"653a6a06ae155b92bae6c820","id":"aekilica/Alpaca_Dolly","author":"aekilica","disabled":false,"gated":false,"lastModified":"2023-10-26T13:45:38.000Z","likes":4,"trendingScore":1,"private":false,"sha":"d3eedfce2ba4cd04b9b57afa2186985e6648e998","downloads":40,"tags":["task_categories:table-question-answering","task_categories:question-answering","language:tr","license:apache-2.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-10-26T13:30:46.000Z","key":""},{"_id":"653a6f30c75dc136bfb0ad1f","id":"lmsys/toxic-chat","author":"lmsys","disabled":false,"gated":false,"lastModified":"2024-05-14T08:07:42.000Z","likes":195,"trendingScore":1,"private":false,"sha":"29df8e4dba60e1f4af4b4075c0705c5b313548a8","description":"\n\t\n\t\t\n\t\tUpdate\n\t\n\n[01/31/2024] We update the OpenAI Moderation API results for ToxicChat (0124) based on their updated moderation model on on Jan 25, 2024.[01/28/2024] We release an official T5-Large model trained on ToxicChat (toxicchat0124). Go and check it for you baseline comparision![01/19/2024] We have a new version of ToxicChat (toxicchat0124)!\n\n\t\n\t\t\n\t\tContent\n\t\n\nThis dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo.\nWe utilize a human-AI… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/toxic-chat.","downloads":5366,"tags":["task_categories:text-classification","language:en","license:cc-by-nc-4.0","size_categories:10K<n<100K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2310.17389","region:us"],"createdAt":"2023-10-26T13:52:48.000Z","key":""},{"_id":"653bdc2e5d84c01f013bacca","id":"rag-datasets/rag-mini-wikipedia","author":"rag-datasets","disabled":false,"gated":false,"lastModified":"2024-06-02T11:14:04.000Z","likes":51,"trendingScore":1,"private":false,"sha":"1f9f3b53fbc5995b85aab8e993504ad42c5f16f6","description":"In this huggingface discussion you can share what you used the dataset for.\nDerives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.\n","downloads":1681,"tags":["task_categories:question-answering","task_categories:sentence-similarity","language:en","license:cc-by-3.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","rag","wikipedia","open-domain","information-retrieval","dpr"],"createdAt":"2023-10-27T15:50:06.000Z","key":""},{"_id":"653bdc7580fbbdd6bf2792be","id":"rag-datasets/rag-mini-bioasq","author":"rag-datasets","disabled":false,"gated":false,"lastModified":"2024-06-17T06:55:33.000Z","likes":38,"trendingScore":1,"private":false,"sha":"224a87f64a5c3a720b5bc627cf760f543b2f1e79","description":"See here for an updated version without nans in text-corpus.\nIn this huggingface discussion you can share what you used the dataset for.\nDerives from http://participants-area.bioasq.org/Tasks/11b/trainingDataset/ we generated our own subset using generate.py.\n","downloads":2319,"tags":["task_categories:question-answering","task_categories:sentence-similarity","language:en","license:cc-by-2.5","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","rag","dpr","information-retrieval","question-answering","biomedical"],"createdAt":"2023-10-27T15:51:17.000Z","key":""},{"_id":"653be37343f068f20b3ce47b","id":"Malikeh1375/medical-question-answering-datasets","author":"Malikeh1375","disabled":false,"gated":false,"lastModified":"2026-04-09T17:57:59.000Z","likes":77,"trendingScore":1,"private":false,"sha":"29833779cb5921f474d9f469aa85c115277bf489","downloads":849,"tags":["task_categories:question-answering","language:en","license:mit","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","medical","clinical","healthcare"],"createdAt":"2023-10-27T16:21:07.000Z","key":""},{"_id":"653c445377c2f094527e6e9e","id":"leduckhai/VietMed","author":"leduckhai","disabled":false,"gated":false,"lastModified":"2026-05-25T16:25:07.000Z","likes":24,"trendingScore":1,"private":false,"sha":"cc7980cd1392d2d85cf1c692b1d96e8581fc7eec","description":"\n\t\n\t\t\n\t\tVietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain (LREC-COLING 2024, Oral)\n\t\n\n\n\t\n\t\t\n\t\tDescription:\n\t\n\nWe introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. \nTo our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects:\ntotal duration… See the full description on the dataset page: https://huggingface.co/datasets/leduckhai/VietMed.","downloads":402,"tags":["task_categories:automatic-speech-recognition","language:vi","license:mit","size_categories:1K<n<10K","format:parquet","format:optimized-parquet","modality:audio","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2404.05659","region:us","medical"],"createdAt":"2023-10-27T23:14:27.000Z","key":""},{"_id":"6540c7be0a2101c338ebc536","id":"jinaai/ger_da_lir","author":"jinaai","disabled":false,"gated":false,"lastModified":"2023-11-02T14:29:36.000Z","likes":1,"trendingScore":1,"private":false,"sha":"0bb47f1d73827e96964edb84dfe552f62f4fd5eb","description":"A legal retrieval dataset in Germnan\nhttps://github.com/lavis-nlp/GerDaLIR","downloads":22,"tags":["region:us"],"createdAt":"2023-10-31T09:24:14.000Z","key":""},{"_id":"6544e36d73b6596c0b1c58c2","id":"SoAp9035/turkish_instructions","author":"SoAp9035","disabled":false,"gated":false,"lastModified":"2025-01-31T17:26:57.000Z","likes":8,"trendingScore":1,"private":false,"sha":"65ce242c0aafdff45d3225ef27283906f35b635c","description":"\n\t\n\t\t\n\t\tTurkish Instructions\n\t\n\n\n\t\n\t\t\n\t\tApache 2.0\n\t\n\nPlanning to update this dataset. (31.01.2025)\nThis dataset is a cleaned and organized version (for Mistral) of afkfatih/turkishdataset\n","downloads":63,"tags":["language:tr","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-11-03T12:11:25.000Z","key":""},{"_id":"654818d4407bb19ff5aa2398","id":"mpingale/mental-health-chat-dataset","author":"mpingale","disabled":false,"gated":false,"lastModified":"2023-11-05T22:36:10.000Z","likes":21,"trendingScore":1,"private":false,"sha":"db4f51d5fa8f030b239925f700e3d96118975814","description":"\n\t\n\t\t\n\t\tDataset Card for \"mental-health-chat-dataset\"\n\t\n\nMore Information needed\n","downloads":156,"tags":["size_categories:1K<n<10K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-11-05T22:36:04.000Z","key":""},{"_id":"6549672908775ce78e461e83","id":"HackerNoon/tech-company-news-data-dump","author":"HackerNoon","disabled":false,"gated":"auto","lastModified":"2024-02-20T00:03:35.000Z","likes":42,"trendingScore":1,"private":false,"sha":"cc6144ccce683dcebb6e63f9a50ca084544af1c0","description":"HackerNoon curated the internet's most cited 7M+ tech company news articles and blog posts about the 3k+ most valuable tech companies in 2022 and 2023. These stories were curated to power HackerNoon.com/Companies, where we update daily news on top technology companies like Microsoft, Google, and HuggingFace. Please use this news data freely for your project, and as always anyone is welcome to publish on HackerNoon.\n","downloads":32,"tags":["task_categories:text-classification","task_categories:summarization","language:en","license:mit","size_categories:1M<n<10M","format:csv","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","news","technology news","company news","tech company news","tech news","technology company news","tech company blogs","technology company blogs","hackernoon","hacker noon","news curation","tech news curation","tech company news curation","technology company news curation","tech blog curation","technology blog curation","brave search api","bing news api","hackernoon api","hacker noon api","tech company news api","technology company news api"],"createdAt":"2023-11-06T22:22:33.000Z","key":""},{"_id":"654ba552acfb5e4a81c52b30","id":"materials-toolkits/materials-project","author":"materials-toolkits","disabled":false,"gated":false,"lastModified":"2024-02-07T11:26:10.000Z","likes":4,"trendingScore":1,"private":false,"sha":"c6c45cbc608018c8593e8ebc8b61ba0ee4d892b6","description":"\n\t\n\t\t\n\t\tDataset\n\t\n\nMaterials project (2019 dump)\nThis dataset contains 133420 materials with formation energy per atom.\nProcessed from mp.2019.04.01.json\n\n\t\n\t\t\n\t\tDownload\n\t\n\nDownload link: materials-project.tar.gz\nMD5 checksum c132f3781f32cd17f3a92aa6501b9531\n\n\t\n\t\t\n\t\tContent\n\t\n\nBundled in materials-project.tar.gz.\n\n\t\n\t\t\n\t\tIndex (index.json)\n\t\n\nlist of dict:\n\nindex (int) => index of the structure in data file.\nid (str) => id of Materials Project.\nformula (str) => formula.\nnatoms (int) => number… See the full description on the dataset page: https://huggingface.co/datasets/materials-toolkits/materials-project.","downloads":80,"tags":["license:mit","size_categories:100K<n<1M","region:us","chemistry"],"createdAt":"2023-11-08T15:12:18.000Z","key":""},{"_id":"654ca4e02772771fb294065e","id":"NghiemAbe/translation-vietnamese-english","author":"NghiemAbe","disabled":false,"gated":false,"lastModified":"2025-09-05T01:52:28.000Z","likes":2,"trendingScore":1,"private":false,"sha":"a0eeff336ee7d6fef23bdbc076018515c568b920","description":"Test data: PhoMT\nTrain data: PhoMT (filter len between 40 to 100)\n","downloads":67,"tags":["task_categories:translation","language:vi","language:en","license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-11-09T09:22:40.000Z","key":""},{"_id":"654e1de722e6880da5e4e3f2","id":"ajibawa-2023/Python-Code-23k-ShareGPT","author":"ajibawa-2023","disabled":false,"gated":false,"lastModified":"2023-11-11T12:27:43.000Z","likes":42,"trendingScore":1,"private":false,"sha":"3cd90afe6017be6f5ec3648ecd25620177446200","description":"This dataset is in Vicuna/ShareGPT format. There are 23000+ set of conversations. Each set having 2 conversations.\nAlong with the Python code detailed explanation is provided.\nThis dataset was generated using GPT-3.5, GPT-4 etc.\n","downloads":446,"tags":["language:en","license:cc-by-nc-nd-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-11-10T12:11:19.000Z","key":""},{"_id":"654e32060e90d38e757f2f9a","id":"SuryaKrishna02/aya-telugu-news-articles","author":"SuryaKrishna02","disabled":false,"gated":false,"lastModified":"2024-01-30T05:26:18.000Z","likes":6,"trendingScore":1,"private":false,"sha":"f509e37b01dd96a7248eaba8080e0b8ec81fbebf","description":"\n\t\n\t\t\n\t\tSummary\n\t\n\naya-telugu-news-articles is an open source dataset of instruct-style records generated by webscraping a Telugu news articles website. This was created as part of Aya Open Science Initiative from Cohere For AI.\nThis dataset can be used for any purpose, whether academic or commercial, under the terms of the Apache 2.0 License.\nSupported Tasks:\n\nTraining LLMs\nSynthetic Data Generation\nData Augmentation\n\nLanguages: Telugu Version: 1.0\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Overview… See the full description on the dataset page: https://huggingface.co/datasets/SuryaKrishna02/aya-telugu-news-articles.","downloads":367,"tags":["task_categories:text-generation","task_ids:language-modeling","annotations_creators:expert-generated","language_creators:expert-generated","multilinguality:monolingual","source_datasets:original","language:te","license:apache-2.0","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","newspaper","2018-2023"],"createdAt":"2023-11-10T13:37:10.000Z","key":""},{"_id":"654e82ee552b35816abaf6fa","id":"BangumiBase/attackontitan","author":"BangumiBase","disabled":false,"gated":false,"lastModified":"2024-03-20T10:01:38.000Z","likes":1,"trendingScore":1,"private":false,"sha":"16baabfb241a32fd8ad440bce952c5412d296d24","description":"\n\t\n\t\t\n\t\tBangumi Image Base of Attack On Titan\n\t\n\nThis is the image base of bangumi Attack On Titan, we detected 76 characters, 14308 images in total. The full dataset is here.\nPlease note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability).\nHere is the… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/attackontitan.","downloads":1445,"tags":["license:mit","size_categories:10K<n<100K","modality:image","region:us","art"],"createdAt":"2023-11-10T19:22:22.000Z","key":""},{"_id":"6550f49cab399279803bc07f","id":"teknium/dataforge-economics","author":"teknium","disabled":false,"gated":false,"lastModified":"2023-11-12T23:39:30.000Z","likes":51,"trendingScore":1,"private":false,"sha":"b4d62f5a7d4e20d7b2a887c2d7d49772e08ca18c","description":"\n\n\t\n\t\t\n\t\tDataset Card for dataforge-economics\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset, teknium/dataforge-economics, is a specialized collection of 1,000 synthetic examples in the field of economics. It has been generated using OpenAI's GPT-4 and a custom data synthesis pipeline named DataForge, developed by me.\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\t\n\t\t\n\t\tData Collection and Synthesis\n\t\n\nThe data in teknium/dataforge-economics has been synthetically generated using OpenAI's GPT-4 language model. The… See the full description on the dataset page: https://huggingface.co/datasets/teknium/dataforge-economics.","downloads":61,"tags":["language:eng","license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","economics"],"createdAt":"2023-11-12T15:51:56.000Z","key":""},{"_id":"65547216f87d7896a9b92a4e","id":"guangyil/laion-coco-aesthetic","author":"guangyil","disabled":false,"gated":false,"lastModified":"2023-11-15T10:34:11.000Z","likes":41,"trendingScore":1,"private":false,"sha":"e4e44b67a4690d6de2dd4e92f8e00c4483d3176e","description":"\n\t\n\t\t\n\t\tLAION COCO with aesthetic score and watermark score\n\t\n\nThis dataset contains 10% samples of the LAION-COCO dataset filtered by some text rules (remove url, special tokens, etc.), and image rules (image size > 384x384, aesthetic score>4.75 and watermark probability<0.5). There are total 8,563,753 data instances in this dataset. And the corresponding aesthetic score and watermark score are also included. \nNoted: watermark score in the table means the probability of the existence of the… See the full description on the dataset page: https://huggingface.co/datasets/guangyil/laion-coco-aesthetic.","downloads":224,"tags":["task_categories:image-to-text","task_categories:text-to-image","language:en","license:apache-2.0","size_categories:1M<n<10M","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","laion"],"createdAt":"2023-11-15T07:24:06.000Z","key":""},{"_id":"6555742d91e52f4231623114","id":"fimu-docproc-research/CIVQA-TesseractOCR-LayoutLM","author":"fimu-docproc-research","disabled":false,"gated":false,"lastModified":"2023-11-21T20:45:53.000Z","likes":1,"trendingScore":1,"private":false,"sha":"1ab3d45854ae176f7cbedd1ac29653f437cee051","description":"\n\t\n\t\t\n\t\tCIVQA TesseractOCR LayoutLM Dataset\n\t\n\nThe Czech Invoice Visual Question Answering dataset was created with Tesseract OCR and encoded for the LayoutLM. \nThe pre-encoded dataset can be found on this link: https://huggingface.co/datasets/fimu-docproc-research/CIVQA-TesseractOCR\nAll invoices used in this dataset were obtained from public sources. Over these invoices, we were focusing on 15 different entities, which are crucial for processing the invoices.\n\nInvoice number\nVariable symbol… See the full description on the dataset page: https://huggingface.co/datasets/fimu-docproc-research/CIVQA-TesseractOCR-LayoutLM.","downloads":154,"tags":["language:cs","license:mit","region:us","finance"],"createdAt":"2023-11-16T01:45:17.000Z","key":""},{"_id":"65561b5bae085c2ba7a6e04f","id":"PatronusAI/financebench","author":"PatronusAI","disabled":false,"gated":false,"lastModified":"2024-11-17T18:42:59.000Z","likes":134,"trendingScore":1,"private":false,"sha":"e04404e3a97f69f79c14d42f24981a1c9c3bcd18","description":"FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). This is an open source sample of 150 annotated examples used in the evaluation and analysis of models assessed in the FinanceBench paper.\nThe PDFs linked in the dataset can be found here as well: https://github.com/patronus-ai/financebench/tree/main/pdfs\nThe dataset comprises of questions about publicly traded companies, with corresponding answers and evidence… See the full description on the dataset page: https://huggingface.co/datasets/PatronusAI/financebench.","downloads":4194,"tags":["license:cc-by-nc-4.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2311.11944","region:us"],"createdAt":"2023-11-16T13:38:35.000Z","key":""},{"_id":"6558bfe9c68499a1b941e9c7","id":"Moemu/Muice-Dataset","author":"Moemu","disabled":false,"gated":false,"lastModified":"2026-05-18T13:38:41.000Z","likes":58,"trendingScore":1,"private":false,"sha":"5d9886d7d6e053e0b28e545a0c536c64ca5682c7","description":"\n  \n  Muice-Dataset\n  沐雪角色扮演训练集\n\n\n  🤖ModelScope|\n  🤗HuggingFace|\n  (Github)Muicebot\n\n\n\n\t\n\t\t\n\t\t更新日志\n\t\n\n2026.05.18: 因为作者的论文使用到了本训练集需要引用，故更新 DOI 引用\n2026.02.05: 小型更新，此次更新过后不再有新的数据集产生。\n2025.08.23: 完整开源所有训练集以作研究用途，大幅更新自述文件\n2025.02.14: 更新测试集以便透明化测试流程\n2025.01.29: 新年快乐！为了感谢大家对沐雪训练集的喜欢，我们重写了训练集并额外提供 500 条训练集给大家。你可以在 这里 查看训练集重写目的和具体内容。除此之外，我们用 Sharegpt 格式规范了训练集格式，现在应该不会那么容易报错了...我们期望大家合理使用我们的训练集并训练出更高质量的模型，祝各位生活愉快。\n\n\t\n\t\t\n\t\t简介… See the full description on the dataset page: https://huggingface.co/datasets/Moemu/Muice-Dataset.","downloads":135,"tags":["task_categories:question-answering","task_categories:text-generation","language:zh","license:cc-by-nc-4.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","doi:10.57967/hf/8779","region:us","ACGN"],"createdAt":"2023-11-18T13:45:13.000Z","key":""},{"_id":"655a3a129e9dfc1089096645","id":"neuralwork/fashion-style-instruct","author":"neuralwork","disabled":false,"gated":false,"lastModified":"2023-11-19T17:06:10.000Z","likes":30,"trendingScore":1,"private":false,"sha":"fecb652ca568905befa133e2102a72a05b7f6eac","description":"\n\t\n\t\t\n\t\tStyle Chatbot Dataset\n\t\n\nA style recommendation dataset that contains input (body type and personal clothing style), context (event context) and response triplets. The responses are GPT 3.5 generated outfit combination recommendations given the input body type and personal style prompt and the target / context event.\nOur dataset contains a variety of events such as business functions, cocktail parties, casual gatherings, fancy dates, etc. See an example Mistral-based finetuned model.\n","downloads":210,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-11-19T16:38:42.000Z","key":""},{"_id":"655e6153203bce21fee636c3","id":"cointegrated/taiga_stripped_proza","author":"cointegrated","disabled":false,"gated":false,"lastModified":"2023-11-23T09:48:30.000Z","likes":1,"trendingScore":1,"private":false,"sha":"d8fca3c30e809c9c388407fa2fa24abd76c5fb9a","description":"\n\t\n\t\t\n\t\tDataset Card for \"taiga_stripped_proza\"\n\t\n\nThis is a subset of the Taiga corpus (https://tatianashavrina.github.io/taiga_site), derived from the proza source (a.k.a. \"Fiction\").\nThe dataset consists of plain texts, without morphological and syntactic annotation or metainformation. Apart from stripping the annotations, the texts were not modified.\nFor more details and analysis, and for the texts with annotation or metadata, please refer to website of the corpus.\nOther subsets of Taiga:… See the full description on the dataset page: https://huggingface.co/datasets/cointegrated/taiga_stripped_proza.","downloads":983,"tags":["task_categories:text-generation","task_categories:fill-mask","language:ru","license:cc-by-sa-3.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","taiga","tayga"],"createdAt":"2023-11-22T20:15:15.000Z","key":""},{"_id":"6561eb27f725fc09722d2e76","id":"ylacombe/english_dialects","author":"ylacombe","disabled":false,"gated":false,"lastModified":"2023-11-27T10:32:58.000Z","likes":36,"trendingScore":1,"private":false,"sha":"ed3d69abb765fd304ccdcc646ec6ca3c49740d15","description":"\n\t\n\t\t\n\t\tDataset Card for \"english_dialects\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The speakers self-identified as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.\nThe recording scripts… See the full description on the dataset page: https://huggingface.co/datasets/ylacombe/english_dialects.","downloads":573,"tags":["task_categories:text-to-speech","task_categories:text-to-audio","language:en","license:cc-by-sa-4.0","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-11-25T12:40:07.000Z","key":""},{"_id":"6564d741cfdc8b6433bfba49","id":"MMMU/MMMU","author":"MMMU","disabled":false,"gated":false,"lastModified":"2026-04-21T18:36:31.000Z","likes":330,"trendingScore":1,"private":false,"sha":"4619a102cf5ad2da1abf7e220fde1258d2434cb7","description":"\n\t\n\t\t\n\t\tMMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)\n\t\n\n🌐 Homepage | 🏆 Leaderboard | 🤗 Dataset | 🤗 Paper | 📖 arXiv | GitHub\n\n\t\n\t\t\n\t\n\t\n\t\t🔔News\n\t\n\n\n🛠️[2026-04-21]: Fixed option issue in test_Psychology_15.\n‼️[2026-02-12]: We have released the answers for the test set! You can now evaluate your models on the test set locally! 🎉\n🛠️[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25;… See the full description on the dataset page: https://huggingface.co/datasets/MMMU/MMMU.","downloads":58835,"tags":["task_categories:question-answering","task_categories:visual-question-answering","task_categories:multiple-choice","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2311.16502","region:us","biology","medical","finance","chemistry","music","art","art_theory","design","business","accounting","economics","manage","marketing","health","medicine","basic_medical_science","clinical","pharmacy","public_health","humanities","social_science","history","literature","sociology","psychology","science","geography","math","physics","engineering","agriculture","architecture","computer_science","electronics","energy_and_power","materials","mechanical_engineering"],"createdAt":"2023-11-27T17:52:01.000Z","key":""},{"_id":"656d7a05d848a6683a0c5c75","id":"m-a-p/COIG-CQIA","author":"m-a-p","disabled":false,"gated":false,"lastModified":"2024-04-18T12:10:58.000Z","likes":740,"trendingScore":1,"private":false,"sha":"8b55868c6168adf86c30e7ca0f782cca1c514297","description":"\n    \n      \n    \n\n\n\n\t\n\t\t\n\t\tCOIG-CQIA：Quality is All you need for Chinese Instruction Fine-tuning\n\t\n\n\n\n\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n欢迎来到COIG-CQIA，COIG-CQIA全称为Chinese Open Instruction Generalist - Quality is All You Need， 是一个开源的高质量指令微调数据集，旨在为中文NLP社区提供高质量且符合人类交互行为的指令微调数据。COIG-CQIA以中文互联网获取到的问答及文章作为原始数据，经过深度清洗、重构及人工审核构建而成。本项目受LIMA: Less Is More for Alignment等研究启发，使用少量高质量的数据即可让大语言模型学习到人类交互行为，因此在数据构建中我们十分注重数据的来源、质量与多样性，数据集详情请见数据介绍以及我们接下来的论文。\nWelcome to the COIG-CQIA… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-CQIA.","downloads":19184,"tags":["task_categories:question-answering","task_categories:text-classification","task_categories:text-generation","language:zh","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2403.18058","arxiv:2304.07987","arxiv:2307.09705","region:us"],"createdAt":"2023-12-04T07:04:37.000Z","key":""},{"_id":"656f0a5808bd4deb79ae9d48","id":"blanchon/EuroSat","author":"blanchon","disabled":false,"gated":false,"lastModified":"2023-12-05T11:33:01.000Z","likes":1,"trendingScore":1,"private":false,"sha":"3625b29f4a8b71bcedca09ca7498f3d1fcb45124","downloads":273,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-12-05T11:32:40.000Z","key":""},{"_id":"6579ed3f3abcb27a1c7ed8f9","id":"isp-uv-es/WorldFloodsv2","author":"isp-uv-es","disabled":false,"gated":false,"lastModified":"2025-07-31T16:06:52.000Z","likes":8,"trendingScore":1,"private":false,"sha":"1f3faa2989e69930ac31d5c0a79fd224461123f8","description":"\n\t\n\t\t\n\t\tWorldFloodsv2 dataset\n\t\n\nThis repository contains the WorldFloodsv2 dataset released with the publication:\n\nE. Portalés-Julià, G. Mateo-García, C. Purcell, and L. Gómez-Chova Global flood extent segmentation in optical satellite images. Scientific Reports 13, 20316 (2023). DOI: 10.1038/s41598-023-47595-7.\n\nThe WorldFloodsv2 database contains 509 pairs of Sentinel-2 images and flood segmentation masks. Splitted in train, val and test sets.\nIt requires approximately 76GB of hard-disk… See the full description on the dataset page: https://huggingface.co/datasets/isp-uv-es/WorldFloodsv2.","downloads":13255,"tags":["license:cc-by-nc-4.0","modality:geospatial","modality:image","doi:10.57967/hf/3149","region:us","remote sensing","sentinel2","landsat","floods"],"createdAt":"2023-12-13T17:43:27.000Z","key":""},{"_id":"657b4c786c149b6a964c77c1","id":"pixparse/idl-wds","author":"pixparse","disabled":false,"gated":false,"lastModified":"2024-03-29T17:04:45.000Z","likes":194,"trendingScore":1,"private":false,"sha":"e70d43a77ae233778613acf860df8b2d91e0673b","description":"\n\t\n\t\t\n\t\tDataset Card for Industry Documents Library (IDL)\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nIndustry Documents Library (IDL) is a document dataset filtered from UCSF documents library with 19 million pages kept as valid samples.\nEach document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the idl_data project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up… See the full description on the dataset page: https://huggingface.co/datasets/pixparse/idl-wds.","downloads":7586,"tags":["task_categories:image-to-text","license:other","size_categories:1M<n<10M","format:webdataset","modality:image","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us"],"createdAt":"2023-12-14T18:42:00.000Z","key":""},{"_id":"657c2e8a8d360b690d6755f2","id":"hezarai/parsynth-ocr-200k","author":"hezarai","disabled":false,"gated":false,"lastModified":"2024-05-07T08:55:33.000Z","likes":23,"trendingScore":1,"private":false,"sha":"e55cca662250585aa0278303901ef30b9b0612f9","description":"ParsynthOCR is a synthetic dataset for Persian OCR. This version is a preview of the original 4 million samples dataset (ParsynthOCR-4M).\n\n\t\n\t\t\n\t\tUsage\n\t\n\n\n\t\n\t\t\n\t\t🤗 Datasets\n\t\n\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"hezarai/parsynth-ocr-200k\")\n\n\n\t\n\t\t\n\t\tHezar\n\t\n\npip install hezar\n\nfrom hezar.data import Dataset\n\ndataset = Dataset.load(\"hezarai/parsynth-ocr-200k\", split=\"train\")\n\n","downloads":486,"tags":["task_categories:image-to-image","language:fa","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","hezar"],"createdAt":"2023-12-15T10:46:34.000Z","key":""},{"_id":"657f09ec3e3b5bea663e53c3","id":"banned-historical-archives/banned-historical-archives","author":"banned-historical-archives","disabled":false,"gated":false,"lastModified":"2025-10-19T15:21:40.000Z","likes":48,"trendingScore":1,"private":false,"sha":"7ee825405a889ac77b0d404fac868f9813da0c83","description":"\n\t\n\t\t\n\t\t和谐历史档案馆数据集 - Banned Historical Archives Datasets\n\t\n\n和谐历史档案馆数据集包含已录入 https://banned-historical-archives.github.io 和暂未未录入的原始文件。\n\n\t\n\t\t\n\t\t目录结构\n\t\n\n\nbanned-historical-archives.github.io # 已录入该网站的原始数据，不定期从 github 仓库中同步\nraw # 原始文件\nconfig # 配置文件\n\n\ntodo # 存放暂未录入网站的文件\n\n部分报纸和图片资料存放在单独的仓库:\n\n\t\n\t\t\n名称\n地址\n状态\n\n\n\t\t\n参考消息\nhttps://huggingface.co/datasets/banned-historical-archives/ckxx\n未录入\n\n\n人民日报\nhttps://huggingface.co/datasets/banned-historical-archives/rmrb\n已精选重要的文章录入\n\n\n文汇报… See the full description on the dataset page: https://huggingface.co/datasets/banned-historical-archives/banned-historical-archives.","downloads":1508429,"tags":["size_categories:n<1K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us"],"createdAt":"2023-12-17T14:47:08.000Z","key":""},{"_id":"657f287eaf1698aaac668af5","id":"Gourieff/ReActor","author":"Gourieff","disabled":false,"gated":false,"lastModified":"2026-05-05T09:58:59.000Z","likes":297,"trendingScore":1,"private":false,"sha":"218401e38fb00476b38f319c043b2f85d4db5b87","description":"\n\t\n\t\t\n\t\tReActor Assets\n\t\n\nThe Fast and Simple Face Swap Extension\nComfyUI-ReActor (ex. comfyui-reactor-node) \nsd-webui-reactor\n\n\t\n\t\t\n\t\tModels\n\t\n\n\n\t\n\t\t\nfile\nsource\nlicense\n\n\n\t\t\nbuffalo_l.zip\nDeepInsight\n\n\n\ncodeformer-v0.1.0.pth\nsczhou\n\nGFPGANv1.3.pth\nTencentARC\n\n\n\nGFPGANv1.4.pth\nTencentARC\n\n\n\nGPEN-BFR-512.onnx\nharisreedhar\n\n\n\nRestoreFormer_PP.onnx\nnetrunner.exe\n\n\n\ninswapper_128.onnx\nDeepInsight\n\n\ninswapper_128_fp16.onnx\nHillobar\n\n\n\n\t\n\n","downloads":138118,"tags":["license:mit","region:us"],"createdAt":"2023-12-17T16:57:34.000Z","key":""},{"_id":"65804cb118c6e72a9560e363","id":"Ritvik19/Sudoku-Dataset","author":"Ritvik19","disabled":false,"gated":false,"lastModified":"2024-03-22T04:30:08.000Z","likes":13,"trendingScore":1,"private":false,"sha":"4d5aa527a9fb9aacca0b0d5b8b77d569fa9afcaa","description":"\n\t\n\t\t\n\t\tDataset Card: Sudoku Dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Overview\n\t\n\n\nDataset Sources:\n\n1 million Sudoku games\n3 million Sudoku puzzles with ratings\n4 Million Sudoku Puzzles Easy-to-Hard\n9 Million Sudoku Puzzles and Solutions\n~2k miscellaneous scraped puzzles\n\n\nDataset Size: 17M puzzles (16.7M for training, 300K for evaluation)\n\n\n\n\t\n\t\n\t\n\t\tData Format\n\t\n\n\nFile Format: Parquet files\nDataset Split:\ntrain: 16.9M puzzles\nvalid: 50K puzzles\n\n\n\n\n\t\n\t\t\n\t\tDataset Attributes\n\t\n\nThe dataset contains the… See the full description on the dataset page: https://huggingface.co/datasets/Ritvik19/Sudoku-Dataset.","downloads":912,"tags":["license:apache-2.0","size_categories:10M<n<100M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-12-18T13:44:17.000Z","key":""},{"_id":"65830e19195e913cb6277b2c","id":"Lakera/mosscap_prompt_injection","author":"Lakera","disabled":false,"gated":false,"lastModified":"2025-02-28T07:59:09.000Z","likes":20,"trendingScore":1,"private":false,"sha":"b7e495ff63373ff7f7dabc1e9390cf62b5838570","description":"\n\t\n\t\t\n\t\tmosscap_prompt_injection\n\t\n\n\n\nThis is a dataset of prompt injections submitted to the game Mosscap by Lakera.\nThis variant of the game Gandalf was created for DEF CON 31.\nNote that the Mosscap levels may no longer be available in the future.\nNote that we release every prompt that we received, regardless of whether it truly is a prompt injection or not.\nThere are hundrends of thousands of prompts and many of them are not actual prompt injections (people ask Mosscap all kinds of things).… See the full description on the dataset page: https://huggingface.co/datasets/Lakera/mosscap_prompt_injection.","downloads":773,"tags":["license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2501.07927","region:us"],"createdAt":"2023-12-20T15:54:01.000Z","key":""},{"_id":"658570e3bae0736365b32de4","id":"google/IFEval","author":"google","disabled":false,"gated":false,"lastModified":"2024-08-14T08:21:56.000Z","likes":154,"trendingScore":1,"private":false,"sha":"966cd89545d6b6acfd7638bc708b98261ca58e84","description":"\n\t\n\t\t\n\t\tDataset Card for IFEval\n\t\n\n\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 \"verifiable instructions\" such as \"write in more than 400 words\" and \"mention the keyword of AI at least 3 times\" which can be verified by heuristics. To load the dataset, run:\nfrom datasets import load_dataset\n\nifeval = load_dataset(\"google/IFEval\")\n\n\n\t\n\t\t\n\t\n\t\n\t\tSupported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.","downloads":86351,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2311.07911","region:us"],"createdAt":"2023-12-22T11:20:03.000Z","key":""},{"_id":"6587ff94509bcae23f71024d","id":"OpenAssistant/oasst2","author":"OpenAssistant","disabled":false,"gated":false,"lastModified":"2024-01-11T06:09:29.000Z","likes":294,"trendingScore":1,"private":false,"sha":"179dd21fc55192153d94adb0e0ce8f69e222bf75","description":"\n\t\n\t\t\n\t\tOpen Assistant Conversations Dataset Release 2 (OASST2)\n\t\n\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\nThis dataset contains message trees. Each message tree has an initial prompt message as the root node, \nwhich can have multiple child messages as replies, and these child messages can have multiple replies. \nAll messages have a role property: this can either be \"assistant\" or \"prompter\". The roles in \nconversation threads from prompt to leaf node strictly alternate between \"prompter\" and… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst2.","downloads":6424,"tags":["language:en","language:es","language:ru","language:de","language:pl","language:th","language:vi","language:sv","language:bn","language:da","language:he","language:it","language:fa","language:sk","language:id","language:nb","language:el","language:nl","language:hu","language:eu","language:zh","language:eo","language:ja","language:ca","language:cs","language:bg","language:fi","language:pt","language:tr","language:ro","language:ar","language:uk","language:gl","language:fr","language:ko","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2304.07327","region:us","human-feedback"],"createdAt":"2023-12-24T09:53:24.000Z","key":""},{"_id":"658816d17959448ef50348c0","id":"RealTimeData/bbc_news_alltime","author":"RealTimeData","disabled":false,"gated":false,"lastModified":"2025-06-28T02:44:40.000Z","likes":51,"trendingScore":1,"private":false,"sha":"8dd1ecdc92ac43f9c04a3da3e945537dbb08179b","description":"\n\t\n\t\t\n\t\tRealTimeData Monthly Collection - BBC News\n\t\n\nThis datasets contains all news articles from BBC News that were created every months from 2017 to current.\nTo access articles in a specific month, simple run the following:\nds = datasets.load_dataset('RealTimeData/bbc_news_alltime', '2020-02')\n\nThis will give you all BBC news articles that were created in 2020-02.\n\n\t\n\t\t\n\t\n\t\n\t\tWant to crawl the data by your own?\n\t\n\nPlease head to LatestEval for the crawler scripts.\n\n\t\n\t\t\n\t\n\t\n\t\tCredit… See the full description on the dataset page: https://huggingface.co/datasets/RealTimeData/bbc_news_alltime.","downloads":2850,"tags":["size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-12-24T11:32:33.000Z","key":""},{"_id":"658c52715a8f8a309e8e0c12","id":"Roudranil/shakespearean-and-modern-english-conversational-dataset","author":"Roudranil","disabled":false,"gated":false,"lastModified":"2025-05-17T15:42:50.000Z","likes":4,"trendingScore":1,"private":false,"sha":"c6eabff9b121491322f1ad125e975b421e632059","description":"\n\t\n\t\t\n\t\tDataset Card for Shakespearean and Modern English Conversational Dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset contains dialog pairs taken from Shakespeare's works - the first dialog is a translated text in modern english, and the second dialog is it's actual response as written in Shakespeare's plays. See the github repo for more details.\n","downloads":170,"tags":["language:en","size_categories:1K<n<10K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","fine-tuning","shakespeare"],"createdAt":"2023-12-27T16:36:01.000Z","key":""},{"_id":"65918cec5b7553ca5ca2d706","id":"SebastianBodza/Quora_deutsch_ger_Pairs_RL_DPO","author":"SebastianBodza","disabled":false,"gated":"auto","lastModified":"2024-01-05T19:22:02.000Z","likes":1,"trendingScore":1,"private":false,"sha":"f81129b4c2a5453b6c037e0571a52b701da8f6b8","downloads":18,"tags":["size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2023-12-31T15:46:52.000Z","key":""},{"_id":"65937b9d89145cbc7c4519ab","id":"CausalLM/GPT-4-Self-Instruct-Turkish","author":"CausalLM","disabled":false,"gated":false,"lastModified":"2025-02-11T14:17:00.000Z","likes":24,"trendingScore":1,"private":false,"sha":"4a91276053fe389c20d30fbe0446d34d877fe291","description":"As per the community's request, here we share a Turkish dataset synthesized using the OpenAI GPT-4 model with Self-Instruct, utilizing some excess Azure credits. Please feel free to use it. All questions and answers are newly generated by GPT-4, without specialized verification, only simple filtering and strict semantic similarity control have been applied.\nWe hope that this will be helpful for fine-tuning open-source models for non-English languages, particularly Turkish. This dataset will be… See the full description on the dataset page: https://huggingface.co/datasets/CausalLM/GPT-4-Self-Instruct-Turkish.","downloads":104,"tags":["language:tr","license:cc-by-4.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","gpt4"],"createdAt":"2024-01-02T02:57:33.000Z","key":""},{"_id":"65967066c57a761c569fdfcb","id":"nguyenvulebinh/asr-alignment","author":"nguyenvulebinh","disabled":false,"gated":false,"lastModified":"2024-01-08T08:48:13.000Z","likes":5,"trendingScore":1,"private":false,"sha":"d4dfbb6e382fcebe5ef340c38f83fa753d47d7b0","description":"\n\t\n\t\t\n\t\tSpeech Recognition Alignment Dataset\n\t\n\nThis dataset is a variation of several widely-used ASR datasets, encompassing Librispeech, MuST-C, TED-LIUM, VoxPopuli, Common Voice, and GigaSpeech. The difference is this dataset includes:\n\nPrecise alignment between audio and text. \nText that has been punctuated and made case-sensitive.\nIdentification of named entities in the text.\n\n\n\t\n\t\t\n\t\tUsage\n\t\n\nFirst, install the latest version of the 🤗 Datasets package:\npip install --upgrade pip\npip… See the full description on the dataset page: https://huggingface.co/datasets/nguyenvulebinh/asr-alignment.","downloads":6897,"tags":["language:en","license:apache-2.0","size_categories:10M<n<100M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-04T08:46:30.000Z","key":""},{"_id":"659d4df2dda11310d8a7a840","id":"reach-vb/jenny_tts_dataset","author":"reach-vb","disabled":false,"gated":false,"lastModified":"2024-01-09T14:11:57.000Z","likes":37,"trendingScore":1,"private":false,"sha":"2fde2659322f4d7d006c3ab7a4a08b2c12595324","description":"\n\t\n\t\t\n\t\tJenny TTS Dataset\n\t\n\nA high-quality, varied ~30hr voice dataset suitable for training a TTS model.\nVoice is recorded by Jenny. She's Irish.\nMaterial read include:\n\nNewspaper headlines\nTranscripts of various Youtube videos\nAbout 2/3 of the book '1984'\nSome of the book 'Little Women'\nWikipedia articles, different topics (philosophy, history, science)\nRecipes\nReddit comments\nSong lyrics, including rap lyrics\nTranscripts to the show 'Friends'\n\nAudio files are 48khz, 16-bit PCM files, 2… See the full description on the dataset page: https://huggingface.co/datasets/reach-vb/jenny_tts_dataset.","downloads":364,"tags":["size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-09T13:45:22.000Z","key":""},{"_id":"65a01739828e7dfb10a6190d","id":"CohereLabs/wikipedia-2023-11-embed-multilingual-v3","author":"CohereLabs","disabled":false,"gated":false,"lastModified":"2026-03-25T04:05:03.000Z","likes":248,"trendingScore":1,"private":false,"sha":"ade45fb52bd549f5e8c065636fe4160a43c2af36","description":"\n\t\n\t\t\n\t\tMultilingual Embeddings for Wikipedia in 300+ Languages\n\t\n\nThis dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. \nThe individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings.\nYou… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/wikipedia-2023-11-embed-multilingual-v3.","downloads":10864,"tags":["size_categories:100M<n<1B","modality:text","region:us"],"createdAt":"2024-01-11T16:28:41.000Z","key":""},{"_id":"65a12e81eeb6095d954bb7fe","id":"Teklia/IAM-line","author":"Teklia","disabled":false,"gated":false,"lastModified":"2024-03-14T16:19:29.000Z","likes":25,"trendingScore":1,"private":false,"sha":"fbdad97500ce54635c0d1ba306bf535cb40656cf","description":"\n\t\n\t\t\n\t\tIAM - line level\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments.\nNote that all images are resized to a fixed height of 128 pixels.\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nAll the documents in the dataset are written in English.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\n{\n  'image':… See the full description on the dataset page: https://huggingface.co/datasets/Teklia/IAM-line.","downloads":2857,"tags":["task_categories:image-to-text","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","atr","htr","ocr","modern","handwritten"],"createdAt":"2024-01-12T12:20:17.000Z","key":""},{"_id":"65a8647f3212568deff4a5ec","id":"zefang-liu/phishing-email-dataset","author":"zefang-liu","disabled":false,"gated":false,"lastModified":"2024-01-17T23:48:20.000Z","likes":35,"trendingScore":1,"private":false,"sha":"34085a032c123ca237f314a01a67909cdea35e34","description":"\n\t\n\t\t\n\t\tPhishing Email Dataset\n\t\n\nThis dataset on Hugging Face is a direct copy of the 'Phishing Email Detection' dataset from Kaggle, shared under the GNU Lesser General Public License 3.0. The dataset was originally created by the user 'Cyber Cop' on Kaggle.  For complete details, including licensing and usage information, please visit the original Kaggle page.\n","downloads":643,"tags":["task_categories:text-classification","language:en","license:lgpl-3.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-17T23:36:31.000Z","key":""},{"_id":"65aae36330225893713e2e22","id":"Zainabsa99/attack","author":"Zainabsa99","disabled":false,"gated":false,"lastModified":"2024-02-25T11:56:26.000Z","likes":1,"trendingScore":1,"private":false,"sha":"88486f2088a619338bd959aa0f8aa64693571121","downloads":20,"tags":["size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-19T21:02:27.000Z","key":""},{"_id":"65b47b3b2524ac36e7c72e93","id":"lmms-lab/COCO-Caption","author":"lmms-lab","disabled":false,"gated":false,"lastModified":"2024-03-08T03:18:01.000Z","likes":14,"trendingScore":1,"private":false,"sha":"c3213b54e173ae86517d481de17f31439d6f8c72","description":"\n\n\n\n\n\t\n\t\t\n\t\tLarge-scale Multi-modality Models Evaluation Suite\n\t\n\n\nAccelerating the development of large-scale multi-modality models (LMMs) with lmms-eval\n\n🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets\n\n\t\n\t\t\n\t\n\t\n\t\tThis Dataset\n\t\n\nThis is a formatted version of COCO-Caption-2014-version. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models.\n@misc{lin2015microsoft,\n      title={Microsoft COCO: Common Objects in Context}… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/COCO-Caption.","downloads":4314,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:1405.0312","region:us"],"createdAt":"2024-01-27T03:40:43.000Z","key":""},{"_id":"65b4896e44c5e98bda8a8fab","id":"lmms-lab/RefCOCO","author":"lmms-lab","disabled":false,"gated":false,"lastModified":"2024-03-08T03:23:40.000Z","likes":36,"trendingScore":1,"private":false,"sha":"a5dff0b3194715fda69b2d6a0d2aaafb41eaa407","description":"\n\n\n\n\n\t\n\t\t\n\t\tLarge-scale Multi-modality Models Evaluation Suite\n\t\n\n\nAccelerating the development of large-scale multi-modality models (LMMs) with lmms-eval\n\n🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets\n\n\t\n\t\t\n\t\n\t\n\t\tThis Dataset\n\t\n\nThis is a formatted version of RefCOCO. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models.\n@inproceedings{kazemzadeh-etal-2014-referitgame,\n    title = \"{R}efer{I}t{G}ame: Referring to Objects in… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/RefCOCO.","downloads":8280,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","modality:timeseries","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-27T04:41:18.000Z","key":""},{"_id":"65b4a57ea0dc6ea599ebb0cd","id":"smangrul/ultrachat-feedback-10k-chatml","author":"smangrul","disabled":false,"gated":false,"lastModified":"2024-01-27T06:41:06.000Z","likes":1,"trendingScore":1,"private":false,"sha":"1059501265b4c9a21b3b446f160f9e57565fd472","downloads":16,"tags":["size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-27T06:41:02.000Z","key":""},{"_id":"65b7ac587817e067a5affd59","id":"Witcape/logo_finetune","author":"Witcape","disabled":false,"gated":false,"lastModified":"2024-02-02T11:29:00.000Z","likes":1,"trendingScore":1,"private":false,"sha":"df102c471c3c6b7cd477cb838f58d9b9017daa46","downloads":16,"tags":["license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-29T13:47:04.000Z","key":""},{"_id":"65ba8e995a5dbabc815cf090","id":"wndknd/german-law-bgb","author":"wndknd","disabled":false,"gated":false,"lastModified":"2024-01-31T18:18:52.000Z","likes":6,"trendingScore":1,"private":false,"sha":"4fb2cd5d7fb16998a359ca35e6497029794ade25","description":"The Bürgerliche Gesetzbuch divided by each paragraph for text-generation.\n","downloads":35,"tags":["task_categories:text-generation","language:de","license:mit","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-31T18:16:57.000Z","key":""},{"_id":"65ba901ba41e2dbe892a07f7","id":"wndknd/german-law-stgb","author":"wndknd","disabled":false,"gated":false,"lastModified":"2024-01-31T18:23:34.000Z","likes":3,"trendingScore":1,"private":false,"sha":"774516ae584c6c3ca7b6661eb35548b0b8e1debe","downloads":25,"tags":["license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-31T18:23:23.000Z","key":""},{"_id":"65ba906d3a03c2490a40a251","id":"wndknd/german-law-gg","author":"wndknd","disabled":false,"gated":false,"lastModified":"2024-01-31T18:24:57.000Z","likes":2,"trendingScore":1,"private":false,"sha":"9e97a5350ecdadef554fc18d33e5aac2183b5be6","downloads":18,"tags":["size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-31T18:24:45.000Z","key":""},{"_id":"65ba90cf145d4d463b42554b","id":"wndknd/german-law-sgb-1","author":"wndknd","disabled":false,"gated":false,"lastModified":"2024-01-31T18:26:34.000Z","likes":2,"trendingScore":1,"private":false,"sha":"2fa7d4aea4d19d72011c49cffb531e25005b12fe","downloads":19,"tags":["license:mit","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-01-31T18:26:23.000Z","key":""},{"_id":"65babe4066552223511584fb","id":"CohereLabs/aya_dataset","author":"CohereLabs","disabled":false,"gated":false,"lastModified":"2025-04-15T08:51:55.000Z","likes":355,"trendingScore":1,"private":false,"sha":"f9ea04583f02a8f86404ff6c58bf75fe637df8a2","description":"\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe Aya Dataset is a multilingual instruction fine-tuning dataset curated by an open-science community via Aya Annotation Platform from Cohere Labs. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators.\nThis dataset can be used to train, finetune, and evaluate multilingual LLMs.\n\nCurated by: Contributors of Aya Open Science Intiative.\n\nLanguage(s): 65 languages (71 including dialects &… See the full description on the dataset page: https://huggingface.co/datasets/CohereLabs/aya_dataset.","downloads":7879,"tags":["task_categories:other","annotations_creators:crowdsourced","annotations_creators:expert-generated","language_creators:crowdsourced","language_creators:expert-generated","multilinguality:multilingual","source_datasets:original","language:amh","language:arb","language:ary","language:ars","language:acq","language:arz","language:apc","language:ben","language:ceb","language:dan","language:deu","language:ell","language:eng","language:eus","language:fil","language:fin","language:fra","language:gle","language:guj","language:hat","language:hau","language:hin","language:hun","language:ibo","language:ind","language:ita","language:jav","language:jpn","language:kan","language:kir","language:kor","language:kur","language:lit","language:mal","language:mar","language:mlg","language:msa","language:mya","language:nep","language:nld","language:nso","language:nya","language:pan","language:pes","language:pol","language:por","language:pus","language:rus","language:sin","language:sna","language:snd","language:som","language:spa","language:sqi","language:srp","language:sun","language:swa","language:swe","language:tam","language:tel","language:tha","language:tur","language:ukr","language:urd","language:vie","language:wol","language:xho","language:yor","language:zho","language:zul","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2402.06619","region:us"],"createdAt":"2024-01-31T21:40:16.000Z","key":""},{"_id":"65badf10185cd2ba00e63ecf","id":"stanfordnlp/colorswap","author":"stanfordnlp","disabled":false,"gated":"auto","lastModified":"2024-02-06T22:23:20.000Z","likes":8,"trendingScore":1,"private":false,"sha":"3e1420f4997c46b871f0436635750f9fc6623459","description":"\n\t\n\t\t\n\t\tColorSwap: A Color and Word Order Dataset for Multimodal Evaluation\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nColorSwap is a dataset designed to assess and improve the proficiency of multimodal models in matching objects with their colors. The dataset is comprised of 2,000 unique image-caption pairs, grouped into 1,000 examples. Each example includes a caption-image pair, along with a \"color-swapped\" pair. Crucially, the two captions in an example have the same words, but the color words have… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/colorswap.","downloads":13,"tags":["license:mit","size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-02-01T00:00:16.000Z","key":""},{"_id":"65bb837dff80913e34290457","id":"mii-llm/gazzetta-ufficiale","author":"mii-llm","disabled":false,"gated":false,"lastModified":"2024-03-05T07:32:14.000Z","likes":39,"trendingScore":1,"private":false,"sha":"ab335f2f0bdd7bd144f51092e0c603e10068365d","description":"\n\t\n\t\t\n\t\tGazzetta Ufficiale 👩🏻‍⚖️⚖️🏛️📜🇮🇹\n\t\n\n\n\nLa Gazzetta Ufficiale della Repubblica Italiana, quale fonte ufficiale di conoscenza delle norme in vigore in Italia e strumento di diffusione, informazione e ufficializzazione di testi legislativi, atti pubblici e privati, è edita dall’Istituto Poligrafico e Zecca dello Stato e pubblicata in collaborazione con il Ministero della Giustizia, il quale provvede alla direzione e redazione della stessa. L'Istituto Poligrafico e Zecca dello Stato… See the full description on the dataset page: https://huggingface.co/datasets/mii-llm/gazzetta-ufficiale.","downloads":451,"tags":["task_categories:text-generation","task_categories:fill-mask","language:it","license:mit","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","doi:10.57967/hf/2708","region:us","law","legal"],"createdAt":"2024-02-01T11:41:49.000Z","key":""},{"_id":"65bbbeaef3c2c31b1f0ce2df","id":"DBD-research-group/BirdSet","author":"DBD-research-group","disabled":false,"gated":false,"lastModified":"2026-05-08T10:07:58.000Z","likes":28,"trendingScore":1,"private":false,"sha":"f1e476b0ba15749927df577f194adca1e5513297","citation":"    @misc{rauch2025birdsetlargescaledatasetaudio,\n          title={BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics}, \n          author={Lukas Rauch and Raphael Schwinger and Moritz Wirth and René Heinrich and Denis Huseljic and Marek Herde and Jonas Lange and Stefan Kahl and Bernhard Sick and Sven Tomforde and Christoph Scholz},\n          year={2025},\n          eprint={2403.10380},\n          archivePrefix={arXiv},\n          primaryClass={cs.SD},\n          url={https://arxiv.org/abs/2403.10380}, \n    }","description":"    Deep learning (DL) has greatly advanced audio classification, \n    yet the field is limited by the scarcity of large-scale benchmark datasets that have propelled progress in other domains. \n    While AudioSet is a pivotal step to bridge this gap as a universal-domain dataset, its restricted accessibility and \n    limited range of evaluation use cases challenge its role as the sole resource. Therefore, we introduce BirdSet, \n    a large-scale benchmark dataset for audio classification focusing on avian bioacoustics. \n    BirdSet surpasses AudioSet with over 6,800 recording hours (+17%) from nearly 10,000 classes (x18) for training and more \n    than 400 hours (x7) across eight strongly labeled evaluation datasets. It serves as a versatile resource for use \n    cases such as multi-label classification, covariate shift or self-supervised learning. We benchmark six well-known \n    DL models in multi-label classification across three distinct training scenarios and outline further evaluation use \n    cases in audio classification. We host our dataset on Hugging Face for easy accessibility and offer an extensive \n    codebase to reproduce our results.","downloads":4273,"tags":["task_categories:audio-classification","license:cc-by-nc-4.0","arxiv:2403.10380","doi:10.57967/hf/2468","region:us","audio classification","multi-label classification","bird sounds","passive acoustic monitoring"],"createdAt":"2024-02-01T15:54:22.000Z","key":""},{"_id":"65bf982a719492167d8325ca","id":"lamhieu/mabrycodes_dialogue_vi","author":"lamhieu","disabled":false,"gated":false,"lastModified":"2024-05-17T11:06:37.000Z","likes":3,"trendingScore":1,"private":false,"sha":"bff9550edbaa0dbedeb2528993937b6b3cb913c1","description":"\n\t\n\t\t\n\t\tDescription\n\t\n\nThe dataset is from 5CD-AI/Vietnamese-mabryCodes-tiny-cot-alpaca-gg-translated, formatted as dialogues for speed and ease of use. Many thanks to author for releasing it.\nImportantly, this format is easy to use via the default chat template of transformers, meaning you can use huggingface/alignment-handbook immediately, unsloth.\n\n\t\n\t\t\n\t\n\t\n\t\tStructure\n\t\n\nView online through viewer.\n\n\t\n\t\t\n\t\n\t\n\t\tNote\n\t\n\nWe advise you to reconsider before use, thank you. If you find it useful… See the full description on the dataset page: https://huggingface.co/datasets/lamhieu/mabrycodes_dialogue_vi.","downloads":54,"tags":["task_categories:text-generation","task_categories:question-answering","language:vi","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-02-04T13:59:06.000Z","key":""},{"_id":"65c65ad09f33c948f49b8c07","id":"numind/NuNER","author":"numind","disabled":false,"gated":false,"lastModified":"2024-03-19T17:36:54.000Z","likes":42,"trendingScore":1,"private":false,"sha":"1784de71436044100ab9f153435f8a5c0ea4e1b6","description":"Citation:\n@misc{bogdanov2024nuner,\n      title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, \n      author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},\n      year={2024},\n      eprint={2402.15343},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n\n","downloads":203,"tags":["language:en","license:mit","size_categories:1M<n<10M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2402.15343","region:us"],"createdAt":"2024-02-09T17:03:12.000Z","key":""},{"_id":"65c666da10735dcd76ea29e1","id":"ibrahimhamamci/CT-RATE","author":"ibrahimhamamci","disabled":false,"gated":"auto","lastModified":"2026-03-16T14:26:48.000Z","likes":262,"trendingScore":1,"private":false,"sha":"deeca4d89e9f978d4d1bccd88a55071ddbb146bb","description":"\n\n\nThe CT-RATE Team organizes the VLM3D Challenge\n\n\n\nVLM3D 2026 (2nd Edition) → Challenge Finals at MICCAI 2026\nVLM3D 2025 (1st Edition) → Challenge Finals at MICCAI 2025 • Workshop at ICCV 2025\n\n\n\n\n\n\n\nThe CT-RATE Team is developing the MR-RATE Dataset\n\n\n\nA large-scale brain MRI dataset with paired radiology reports for training 3D vision-language models.\n\nGitHub   |  \nDataset   |  \nMetadata Dashboard\n\n\n\n\n\n\t\n\t\t\n\t\tGeneralist Foundation Models from a Multimodal Dataset for 3D Computed Tomography… See the full description on the dataset page: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE.","downloads":97686,"tags":["task_categories:image-to-text","task_categories:text-to-image","task_categories:image-classification","task_categories:question-answering","task_categories:visual-question-answering","task_categories:zero-shot-classification","language:en","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","arxiv:2403.17834","region:us","chest-ct","radiology","science","huggingscience","3d-medical-imaging","medical","ct-rate","multimodal","vision-language","healthcare","diagnostic-imaging","computer-vision","foundation-model"],"createdAt":"2024-02-09T17:54:34.000Z","key":""},{"_id":"65c69b0bc718b4df69df3b45","id":"doof-ferb/vlsp2020_vinai_100h","author":"doof-ferb","disabled":false,"gated":false,"lastModified":"2025-04-20T20:19:38.000Z","likes":15,"trendingScore":1,"private":false,"sha":"7d58a85358da9f5208426e8beb843cb2dfa40178","description":"\n\t\n\t\t\n\t\tunofficial mirror of VLSP 2020 - VinAI - ASR challenge dataset\n\t\n\nofficial announcement:\n\ntiếng việt: https://institute.vinbigdata.org/events/vinbigdata-chia-se-100-gio-du-lieu-tieng-noi-cho-cong-dong/\nin eglish: https://institute.vinbigdata.org/en/events/vinbigdata-shares-100-hour-data-for-the-community/\nVLSP 2020 workshop: https://vlsp.org.vn/vlsp2020\n\nofficial download: https://drive.google.com/file/d/1vUSxdORDxk-ePUt-bUVDahpoXiqKchMx/view?usp=sharing\ncontact: info@vinbigdata.org… See the full description on the dataset page: https://huggingface.co/datasets/doof-ferb/vlsp2020_vinai_100h.","downloads":636,"tags":["task_categories:automatic-speech-recognition","task_categories:text-to-speech","language:vi","license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-02-09T21:37:15.000Z","key":""},{"_id":"65c75fcd522ee3a4ed89d43c","id":"H-Liu1997/BEAT2","author":"H-Liu1997","disabled":false,"gated":false,"lastModified":"2024-02-10T11:50:54.000Z","likes":10,"trendingScore":1,"private":false,"sha":"8689ecb43513ba31964fd60e0ca69be02d3b0872","downloads":15562,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:csv","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-02-10T11:36:45.000Z","key":""},{"_id":"65c87c0dfe3148b91827ecc7","id":"lmms-lab/MP-DocVQA","author":"lmms-lab","disabled":false,"gated":false,"lastModified":"2024-02-11T09:00:53.000Z","likes":6,"trendingScore":1,"private":false,"sha":"e776eeade5a631d9ae18230ed3ab6213cd063dac","downloads":967,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-02-11T07:49:33.000Z","key":""},{"_id":"65c8f5d9699c1964958b49f9","id":"Fmfawaz32/mitre-attack","author":"Fmfawaz32","disabled":false,"gated":false,"lastModified":"2024-02-11T16:30:07.000Z","likes":1,"trendingScore":1,"private":false,"sha":"d7a0ef6e7a533a01b68538078c195ca785aa7959","downloads":21,"tags":["license:apache-2.0","region:us"],"createdAt":"2024-02-11T16:29:13.000Z","key":""},{"_id":"65cba401877f943912b7d0ed","id":"2A2I/Arabic_Aya","author":"2A2I","disabled":false,"gated":false,"lastModified":"2024-03-15T11:08:04.000Z","likes":16,"trendingScore":1,"private":false,"sha":"ff01e536a94848fc8fd56da54f2a79f3e70ac943","description":"\n\t\n\t\t\n\t\tDataset Card for : Arabic Aya (2A)\n\t\n\n\n\n\n\n\n\t\n\t\t\n\t\tArabic Aya (2A) : A Curated Subset of the Aya Collection for Arabic Language Processing\n\t\n\n\n\t\n\t\t\n\t\tDataset Sources & Infos\n\t\n\n\nData Origin: Derived from 69 subsets of the original Aya datasets : CohereForAI/aya_collection, CohereForAI/aya_dataset, and CohereForAI/aya_evaluation_suite.\nLanguages: Modern Standard Arabic (MSA) and a variety of Arabic dialects ( 'arb', 'arz', 'ary', 'ars', 'knc', 'acm', 'apc', 'aeb', 'ajp', 'acq' )… See the full description on the dataset page: https://huggingface.co/datasets/2A2I/Arabic_Aya.","downloads":1228,"tags":["task_categories:text-classification","task_categories:translation","task_categories:summarization","language:ar","license:apache-2.0","size_categories:10M<n<100M","modality:tabular","modality:text","arxiv:2402.06619","region:us"],"createdAt":"2024-02-13T17:16:49.000Z","key":""},{"_id":"65d081682d592af8eafaede7","id":"sayhan/strix-philosophy-qa","author":"sayhan","disabled":false,"gated":false,"lastModified":"2024-03-01T18:17:59.000Z","likes":29,"trendingScore":1,"private":false,"sha":"f5f4502cbe2f74f68c3000f1bafa7b4d945f4d21","description":"\n\t\n\t\t\n\t\tStrix\n\t\n\n\n134k question-answer pairs based on AiresPucrs' stanford-encyclopedia-philosophy dataset.\n","downloads":238,"tags":["task_categories:question-answering","language:en","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","philosophy"],"createdAt":"2024-02-17T09:50:32.000Z","key":""},{"_id":"65d4ad656ef9b89fef35c6b4","id":"UniqueData/dicom-brain-dataset","author":"UniqueData","disabled":false,"gated":false,"lastModified":"2025-08-29T07:05:54.000Z","likes":6,"trendingScore":1,"private":false,"sha":"7631b3fc8cb5521d4ddee78daeda059b2635e4c9","description":"\n\t\n\t\t\n\t\tBrain MRI Dataset, Normal Brain Dataset, Anomaly Classification & Detection\n\t\n\nThe dataset consists of .dcm files containing MRI scans of the brain of the person with a normal brain. The images are labeled by the doctors and accompanied by report in PDF-format. \nThe dataset includes 7 studies, made from the different angles which provide a comprehensive understanding of a normal brain structure and useful in training brain anomaly classification algorithms.\n\n\t\n\t\t\n\t\n\t\n\t\tMRI study angles… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/dicom-brain-dataset.","downloads":291,"tags":["task_categories:image-to-image","task_categories:image-classification","task_categories:image-segmentation","task_categories:object-detection","language:en","license:cc-by-nc-nd-4.0","size_categories:n<1K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us","medical","biology","code","medical-imaging","mri-dataset","brain-mri"],"createdAt":"2024-02-20T13:47:17.000Z","key":""},{"_id":"65d4e87efb0d0560cfd5c2b9","id":"MedRAG/pubmed","author":"MedRAG","disabled":false,"gated":false,"lastModified":"2024-02-27T05:35:03.000Z","likes":108,"trendingScore":1,"private":false,"sha":"33da3593d5756bc04c8909f170003c0b14197957","description":"\n\t\n\t\t\n\t\tThe PubMed Corpus in MedRAG\n\t\n\nThis HF dataset contains the snippets from the PubMed corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).\n\n\t\n\t\t\n\t\tNews\n\t\n\n\n(02/26/2024) The \"id\" column has been reformatted. A new \"PMID\" column is added.\n\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Descriptions\n\t\n\nPubMed is the most widely used literature resource, containing over 36 million biomedical articles. \nFor MedRAG, we use a PubMed subset of 23.9 million… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/pubmed.","downloads":8747,"tags":["task_categories:question-answering","language:en","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2402.13178","region:us","medical","question answering","large language model","retrieval-augmented generation"],"createdAt":"2024-02-20T17:59:26.000Z","key":""},{"_id":"65d5e1212c2151e6208988c4","id":"sarvamai/samvaad-hi-v1","author":"sarvamai","disabled":false,"gated":false,"lastModified":"2024-07-16T06:35:20.000Z","likes":68,"trendingScore":1,"private":false,"sha":"6439910cc9942885e90df1643a9137aecc54aeb5","description":"100k high-quality conversations in English, Hindi, and Hinglish curated exclusively with an Indic context.\n","downloads":114,"tags":["task_categories:text-generation","language:en","language:hi","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-02-21T11:40:17.000Z","key":""},{"_id":"65da187358ea0eec69e7c5ae","id":"BangumiBase/tokyoghoul","author":"BangumiBase","disabled":false,"gated":false,"lastModified":"2024-03-20T20:14:22.000Z","likes":1,"trendingScore":1,"private":false,"sha":"46bcc5a55ce394a0aef9a5aff5f6c2b748466adc","description":"\n\t\n\t\t\n\t\tBangumi Image Base of Tokyo Ghoul\n\t\n\nThis is the image base of bangumi Tokyo Ghoul, we detected 74 characters, 3651 images in total. The full dataset is here.\nPlease note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability).\nHere is the characters'… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/tokyoghoul.","downloads":1598,"tags":["license:mit","size_categories:1K<n<10K","modality:image","modality:text","region:us","art"],"createdAt":"2024-02-24T16:25:23.000Z","key":""},{"_id":"65dc8fd34e23db813de213f0","id":"OpenDFM/MoGUI","author":"OpenDFM","disabled":false,"gated":false,"lastModified":"2025-11-08T12:54:01.000Z","likes":3,"trendingScore":1,"private":false,"sha":"19639219617d083a03e24842fc620e010cc63189","description":"\n\t\n\t\t\n\t\tMoGUI😈 and MoCon🛡️\n\t\n\n\n\n📃 Paper | 😈 MoGUI Data| 🛡️ MoCon Data \n简体中文 | English\n\n\n\n\t\n\t\t\n\t\t🔥 News\n\t\n\n\n[Cooming Soon] We will release the complete technical report soon.\n[2024.3.1] We have released MoCon🛡️ data.\n[2024.2.29] We have released MoGUI😈 data and pre-release paper.\n\n\n\t\n\t\t\n\t\n\t\n\t\t📑 Citation\n\t\n\nIf you find our work useful, please cite us!\n@inproceedings{zhu2025moba,\n  title={MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation}… See the full description on the dataset page: https://huggingface.co/datasets/OpenDFM/MoGUI.","downloads":447,"tags":["license:cc-by-nc-sa-4.0","region:us","GUI"],"createdAt":"2024-02-26T13:19:15.000Z","key":""},{"_id":"65dcc9d95f27009cd9751ec7","id":"2A2I/Arabic-OpenHermes-2.5","author":"2A2I","disabled":false,"gated":false,"lastModified":"2024-03-15T16:10:48.000Z","likes":21,"trendingScore":1,"private":false,"sha":"e189719530400126d6a1108f9fef2f55f9a4557f","description":"\n\t\n\t\t\n\t\tDataset Card for \"Arabic-OpenHermes-2.5\"\n\t\n\n\n\n\n\t\n\t\t\n\t\tDataset Sources & Infos\n\t\n\n\nData Origin: Derived from the original OpenHermes dataset : teknium/OpenHermes-2.5.\nLanguages: Modern Standard Arabic (MSA)\nApplications: Language Modeling\nMaintainer: Marwa El Kamil & Mohammed Machrouh\nLicense: Apache-2.0\n\n\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n\nArabic-OpenHermes-2.5 is a carefully curated dataset extracted / translated from the OpenHermes-2.5 collection provided by teknium.\n\n\t\n\t\n\t\n\t\tPurpose… See the full description on the dataset page: https://huggingface.co/datasets/2A2I/Arabic-OpenHermes-2.5.","downloads":329,"tags":["language:ar","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","synthetic","GPT-4","Distillation","Compilation"],"createdAt":"2024-02-26T17:26:49.000Z","key":""},{"_id":"65de2c2679177c2f82e321f8","id":"mi-rei/ClinicalTrial-gov_QA","author":"mi-rei","disabled":false,"gated":false,"lastModified":"2024-02-27T18:53:51.000Z","likes":1,"trendingScore":1,"private":false,"sha":"984bde7b128f37c295f3ad84fb003aaeb73e3535","downloads":11,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-02-27T18:38:30.000Z","key":""},{"_id":"65dea52c2915c5bcbbc3b733","id":"CyberNative/Code_Vulnerability_Security_DPO","author":"CyberNative","disabled":false,"gated":false,"lastModified":"2024-02-29T15:24:07.000Z","likes":162,"trendingScore":1,"private":false,"sha":"81aeacf06cf43b16d7278a3a01f019a496a53c51","description":"\n\t\n\t\t\n\t\tCybernative.ai Code Vulnerability and Security Dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThe Cybernative.ai Code Vulnerability and Security Dataset is a dataset of synthetic Data Programming by Demonstration (DPO) pairs, focusing on the intricate relationship between secure and insecure code across a variety of programming languages. This dataset is meticulously crafted to serve as a pivotal resource for researchers, cybersecurity professionals, and AI developers who are keen on… See the full description on the dataset page: https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO.","downloads":3760,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","dpo","cybersecurity","programming","code","Python"],"createdAt":"2024-02-28T03:14:52.000Z","key":""},{"_id":"65df4c0eee6c040bc8d9804b","id":"free-law/Caselaw_Access_Project","author":"free-law","disabled":false,"gated":"manual","lastModified":"2024-03-16T20:01:40.000Z","likes":95,"trendingScore":1,"private":false,"sha":"cee53e98cb2b401865a017ef291a59fbb3407179","description":"\n\n\n\t\n\t\t\n\t\tThe Caselaw Access Project\n\t\n\nIn collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/\nFind more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project.","downloads":111,"tags":["task_categories:text-generation","language:en","license:cc0-1.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","legal","law","caselaw"],"createdAt":"2024-02-28T15:06:54.000Z","key":""},{"_id":"65e127b19f978976f16dbe83","id":"microsoft/orca-math-word-problems-200k","author":"microsoft","disabled":false,"gated":false,"lastModified":"2024-03-04T18:01:08.000Z","likes":486,"trendingScore":1,"private":false,"sha":"29255d1770cc4eac66e5e7fa378cba542c026350","description":"\n\t\n\t\t\n\t\tDataset Card\n\t\n\n\n\nThis dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of\nSLMs in Grade School Math for details about the dataset construction. \n\n\t\n\t\t\n\t\tDataset Sources\n\t\n\n\n\n\nRepository: microsoft/orca-math-word-problems-200k\nPaper: Orca-Math: Unlocking the potential of\nSLMs in Grade School Math\n\n\n\t\n\t\t\n\t\tDirect Use\n\t\n\n\n\nThis dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k.","downloads":7659,"tags":["task_categories:question-answering","language:en","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2402.14830","region:us","math"],"createdAt":"2024-03-01T00:56:17.000Z","key":""},{"_id":"65e18875138bd34f8ed93458","id":"OpenDFM/MoCon","author":"OpenDFM","disabled":false,"gated":false,"lastModified":"2024-03-14T13:54:45.000Z","likes":3,"trendingScore":1,"private":false,"sha":"e37127c9ad04ccd5a1381aa60054683eda8a5462","description":"\n\t\n\t\t\n\t\tMoGUI😈 and MoCon🛡️\n\t\n\n\n\n📃 Paper | 🛡️ MoCon Data | 😈 MoGUI Data\n简体中文 | English\n\n\n\n\t\n\t\t\n\t\t🔥 News\n\t\n\n\n[Cooming Soon] We will release the complete technical report soon.\n[2024.3.1] We have released MoCon🛡️ data.\n[2024.2.29] We have released MoGUI😈 data and pre-release paper.\n\n\n\t\n\t\t\n\t\n\t\n\t\t📑 Citation\n\t\n\nIf you find our work useful, please cite us!\n@misc{zhu2024mogui,\n  title={Technical Report of MoGUI and MoCon}, \n  author={Zichen Zhu and Liangtai Sun and Danyang Zhang and Ziyuan Li… See the full description on the dataset page: https://huggingface.co/datasets/OpenDFM/MoCon.","downloads":34,"tags":["license:cc-by-nc-sa-4.0","region:us","GUI"],"createdAt":"2024-03-01T07:49:09.000Z","key":""},{"_id":"65e1c007b772892fc23e4292","id":"JetBrains-Research/jupyter-errors-dataset","author":"JetBrains-Research","disabled":false,"gated":false,"lastModified":"2024-03-19T10:47:26.000Z","likes":3,"trendingScore":1,"private":false,"sha":"1074d7c95c258c03fb554609f127503bdcbd6d5e","description":"\n\t\n\t\t\n\t\tDataset Summary\n\t\n\n The presented dataset contains 10000 Jupyter notebooks, \n each of which contains at least one error. In addition to the notebook content, \n the dataset also provides information about the repository where the notebook is stored. \n This information can help restore the environment if needed.\n\n\t\n\t\t\n\t\tGetting Started\n\t\n\nThis dataset is organized such that it can be naively loaded via the Hugging Face datasets library. We recommend using streaming due to the large size… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/jupyter-errors-dataset.","downloads":85,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","jupyter notebook"],"createdAt":"2024-03-01T11:46:15.000Z","key":""},{"_id":"65e1e6b1fa76e3de8a97a5dd","id":"alinet/pubmed_qa","author":"alinet","disabled":false,"gated":false,"lastModified":"2024-03-09T13:49:03.000Z","likes":1,"trendingScore":1,"private":false,"sha":"b64f09473b4e4eef20c64010581f1706dd9dd645","downloads":53,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-03-01T14:31:13.000Z","key":""},{"_id":"65e574a84dbf9514fb19e2ea","id":"MoE-UNC/wikihop","author":"MoE-UNC","disabled":false,"gated":false,"lastModified":"2024-03-04T07:24:17.000Z","likes":3,"trendingScore":1,"private":false,"sha":"2a59420990da1085b42b474cc38d2520c27a6325","downloads":141,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-03-04T07:13:44.000Z","key":""},{"_id":"65e58e2457459fa149e93b58","id":"TheFinAI/finben-fomc","author":"TheFinAI","disabled":false,"gated":false,"lastModified":"2025-02-21T20:58:54.000Z","likes":4,"trendingScore":1,"private":false,"sha":"e207a94f7ebbd1c827b119b29b4a90f9673fbd22","description":"\n\n\t\n\t\t\n\t\tDataset Card for FinBen-FOMC\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nFinBen-FOMC is a financial sentiment classification dataset adapted from FOMC (Shah et al., 2023a). The dataset is designed for training and evaluating large language models (LLMs) on classifying central bank policy stances as Hawkish, Dovish, or Neutral.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\n\nTask: Hawkish-Dovish Classification\nEvaluation Metric: F1 Score, Accuracy\nTest Size: 496 instances\n\n\n\t\n\t\t\n\t\tLanguages\n\t\n\n\nEnglish… See the full description on the dataset page: https://huggingface.co/datasets/TheFinAI/finben-fomc.","downloads":133,"tags":["task_categories:text-classification","language:en","license:cc-by-nc-4.0","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2402.12659","region:us","finance"],"createdAt":"2024-03-04T09:02:28.000Z","key":""},{"_id":"65e718949ad994f8e02a5401","id":"chenxz/RareBench","author":"chenxz","disabled":false,"gated":false,"lastModified":"2024-12-12T11:28:13.000Z","likes":12,"trendingScore":1,"private":false,"sha":"6f054e04071953ef2c1779b279074245f2ab398c","description":"RareBench is a pioneering benchmark designed to systematically evaluate the capabilities of LLMs within the realm of rare diseases.","downloads":749,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:1K<n<10K","arxiv:2402.06341","region:us","medical"],"createdAt":"2024-03-05T13:05:24.000Z","key":""},{"_id":"65e88c40bffe3980c632b974","id":"MarkrAI/KoCommercial-Dataset","author":"MarkrAI","disabled":false,"gated":false,"lastModified":"2024-03-22T09:22:58.000Z","likes":166,"trendingScore":1,"private":false,"sha":"c7ae950f2984ab3ff6a2365331bc9fa27cdac185","description":"\n\t\n\t\t\n\t\tSSL 데이터 생성을 위한 코드 공개\n\t\n\nSSL 데이터 생성용 Github Repo\n\nNIA와  AI-Hub와의 저작권 협의 하에, 조금 혼선이 생긴것 죄송합니다.\n\n이에 기존에 저희가 code베이스로 SSL 데이터를 생성했던 코드를 그대로 공개드립니다.\n\n다만, 이 과정에서는 저희 이후 파이프라인인, 자체 로컬 모델을 가지고 필터링하거나 수정하는 과정이 없어, 어느정도 감안을 해주시면 감사하겠습니다.\n\n코드는 누구나 사용하실 수 있고 과제와 Task에 맞게 활용하시면 감사하겠습니다!\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset: KoCommercial-Dataset\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tInfo\n\t\n\nDataset 개수: 약 1.44M\nLicense: MIT \nDataset list(전부 상업적 용도로 이용가능)  \n\nkyujinpy/KOpen-platypus (*Except non-commercial datasets)… See the full description on the dataset page: https://huggingface.co/datasets/MarkrAI/KoCommercial-Dataset.","downloads":329,"tags":["language:ko","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2305.14045","arxiv:2309.09530","arxiv:2107.06499","region:us"],"createdAt":"2024-03-06T15:31:12.000Z","key":""},{"_id":"65ebc6fb6c9ec24a14516cc8","id":"japanese-asr/ja_asr.jsut_basic5000","author":"japanese-asr","disabled":false,"gated":false,"lastModified":"2024-04-14T14:12:37.000Z","likes":11,"trendingScore":1,"private":false,"sha":"278db379fc96167ff2293d7abf9ab86976afcd78","downloads":459,"tags":["size_categories:1K<n<10K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-03-09T02:18:35.000Z","key":""},{"_id":"65ee0700a24bc019111e8fe5","id":"mcemilg/GECTurk-generation","author":"mcemilg","disabled":false,"gated":false,"lastModified":"2024-03-10T19:18:17.000Z","likes":2,"trendingScore":1,"private":false,"sha":"36f6a61aca96cafc4149fa823510a3fa81b98ee3","description":"Homepage: https://github.com/GGLAB-KU/gecturk/\n","downloads":49,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-03-10T19:16:16.000Z","key":""},{"_id":"65f1698ffd7e9976e30d86b2","id":"PleIAs/Italian-PD","author":"PleIAs","disabled":false,"gated":false,"lastModified":"2024-07-29T18:00:53.000Z","likes":11,"trendingScore":1,"private":false,"sha":"4dc027b568a20d8cb25a4d78ee9653fd5ff01d66","description":"\n\t\n\t\t\n\t\t🇮🇹 Italian Public Domain Books (Italian) 🇮🇹\n\t\n\nItalian-Public Domain-Book or Italian-PD-Books is a large collection aiming to aggregate all Italian monographies in the public domain. As of March 2024, it is the biggest Italian open corpus. \n\n\t\n\t\t\n\t\tDataset summary\n\t\n\nThe collection contains 12,945,781,983 words (171,113 titles) recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/Italian-PD.","downloads":314,"tags":["region:us"],"createdAt":"2024-03-13T08:53:35.000Z","key":""},{"_id":"65f217a1651e3e7d085b1200","id":"bastao/VeraCruz_PT-BR","author":"bastao","disabled":false,"gated":false,"lastModified":"2025-07-21T11:53:53.000Z","likes":17,"trendingScore":1,"private":false,"sha":"82bf943cc8ea2163f2d65458c20044d24f42f43c","description":"\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe VeraCruz Dataset is a comprehensive collection of Portuguese language content, showcasing the linguistic and cultural diversity of of Portuguese-speaking regions. It includes around 190 million samples, organized by regional origin as indicated by URL metadata into primary categories. The primary categories are:\n\nPortugal (PT): Samples with content URLs indicating a clear Portuguese origin.\nBrazil (BR): Samples with content URLs indicating a clear Brazilian… See the full description on the dataset page: https://huggingface.co/datasets/bastao/VeraCruz_PT-BR.","downloads":32506,"tags":["task_categories:text-generation","task_categories:text-classification","language:pt","size_categories:100M<n<1B","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","pt","br","portuguese","brazilian","portugal","brazil"],"createdAt":"2024-03-13T21:16:17.000Z","key":""},{"_id":"65f2a4254ab77537428db3ef","id":"hazyresearch/based-swde","author":"hazyresearch","disabled":false,"gated":false,"lastModified":"2024-05-19T06:50:56.000Z","likes":5,"trendingScore":1,"private":false,"sha":"27f0223028a5d1ce174011051d3d22a5fafda830","description":"This dataset is adapted from the paper Language Models Enable Simple Systems for Generating\nStructured Views of Heterogeneous Data Lakes. You can learn more about the data collection process there.\nPlease consider citing the following if you use this task in your work:\n@article{arora2024simple,\n  title={Simple linear attention language models balance the recall-throughput tradeoff},\n  author={Arora, Simran and Eyuboglu, Sabri and Zhang, Michael and Timalsina, Aman and Alberti, Silas and… See the full description on the dataset page: https://huggingface.co/datasets/hazyresearch/based-swde.","downloads":2688,"tags":["task_categories:question-answering","task_categories:feature-extraction","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2402.18668","region:us"],"createdAt":"2024-03-14T07:15:49.000Z","key":""},{"_id":"65f5ee5b9a548898ea96f1b6","id":"Romit2004/LinuxCommands","author":"Romit2004","disabled":false,"gated":false,"lastModified":"2024-03-22T13:37:22.000Z","likes":27,"trendingScore":1,"private":false,"sha":"2941f9386e1a8638ae0c1ce3d857862696b584d8","downloads":111,"tags":["license:mit","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-03-16T19:09:15.000Z","key":""},{"_id":"65f8520224b3b36bf10680bd","id":"jamessyx/PathCap","author":"jamessyx","disabled":false,"gated":"auto","lastModified":"2024-03-19T15:43:58.000Z","likes":13,"trendingScore":1,"private":false,"sha":"17205b19b89ab4ad812a81e9b166d0bcbdaa928e","description":"This is the official Hugging Face repo for PathCap dataset.\n\n\t\n\t\t\n\t\tCitation\n\t\n\n@article{sun2023pathasst,\n  title={Pathasst: Redefining pathology through generative foundation ai assistant for pathology},\n  author={Sun, Yuxuan and Zhu, Chenglu and Zheng, Sunyi and Zhang, Kai and Shui, Zhongyi and Yu, Xiaoxuan and Zhao, Yizhi and Li, Honglin and Zhang, Yunlong and Zhao, Ruojia and others},\n  journal={arXiv preprint arXiv:2305.15072},\n  year={2023}\n}\n\n","downloads":33,"tags":["license:cc-by-nc-2.0","modality:image","arxiv:2305.15072","region:us"],"createdAt":"2024-03-18T14:38:58.000Z","key":""},{"_id":"65f94b8aea2d72e3b59661cf","id":"argilla/ultrafeedback-binarized-preferences-cleaned-kto","author":"argilla","disabled":false,"gated":false,"lastModified":"2024-03-19T12:17:36.000Z","likes":10,"trendingScore":1,"private":false,"sha":"809993879c4c389f965bb9e2b37968c3aceffe4f","description":"\n\t\n\t\t\n\t\tUltraFeedback - Binarized using the Average of Preference Ratings (Cleaned) KTO\n\t\n\n\nA KTO signal transformed version of the highly loved UltraFeedback Binarized Preferences Cleaned, the preferred dataset by Argilla to use from now on when fine-tuning on UltraFeedback\n\nThis dataset represents a new iteration on top of argilla/ultrafeedback-binarized-preferences,\nand is the recommended and preferred dataset by Argilla to use from now on when fine-tuning on UltraFeedback.\nRead more about… See the full description on the dataset page: https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned-kto.","downloads":6850,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2402.01306","region:us","kto","preference","ultrafeedback"],"createdAt":"2024-03-19T08:23:38.000Z","key":""},{"_id":"660258b5fad94d04e71c582c","id":"mPLUG/DocReason25K","author":"mPLUG","disabled":false,"gated":false,"lastModified":"2024-03-26T05:18:53.000Z","likes":8,"trendingScore":1,"private":false,"sha":"1e166c1d05f5b45a8791e72c42e964b117b61c3e","downloads":64,"tags":["license:apache-2.0","size_categories:n<1K","format:webdataset","modality:image","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us"],"createdAt":"2024-03-26T05:10:13.000Z","key":""},{"_id":"66027117e69a30d7570fb914","id":"isek-ai/danbooru-tags-2024","author":"isek-ai","disabled":false,"gated":false,"lastModified":"2025-03-03T07:04:08.000Z","likes":25,"trendingScore":1,"private":false,"sha":"b1ee778f07b50c1ef0da4e87637e5b1168708a7a","description":"\n\t\n\t\t\n\t\tdanbooru-tags-2024\n\t\n\nfrom datasets import load_dataset\n\nds = load_dataset(\n  \"isek-ai/danbooru-tags-2024\",\n# revision=\"202412-at20250122\", # optional\n  split=\"train\",\n)\n\nLast updated: since 2005 to 2024/12/31, collected at 2025/01/22\n","downloads":168,"tags":["task_categories:text-generation","task_categories:text-classification","license:cc0-1.0","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","danbooru"],"createdAt":"2024-03-26T06:54:15.000Z","key":""},{"_id":"6602cb81ee4e5363caf5269b","id":"nllg/datikz","author":"nllg","disabled":false,"gated":false,"lastModified":"2024-05-21T18:51:26.000Z","likes":7,"trendingScore":1,"private":false,"sha":"2d03e7098ab71186a8a7f96c0d115dcda320243f","description":"\n\t\n\t\t\n\t\tDataset Card for DaTikZ\n\t\n\nDaTikZ is a dataset of TikZ drawings aligned with captions. In compliance with licensing agreements, certain TikZ drawings are excluded from this public version of the dataset. Check out the AutomaTikZ project and the DaTikZ repository for more information as well as tools and methods to recreate the complete dataset from scratch.\n\n\t\n\t\t\n\t\n\t\n\t\tUsage\n\t\n\nfrom  datasets  import  load_dataset\n# full dataset\nds = load_dataset(\"nllg/datikz\")\n# only the train split… See the full description on the dataset page: https://huggingface.co/datasets/nllg/datikz.","downloads":889,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-03-26T13:20:01.000Z","key":""},{"_id":"660437d87aa4d442c4f1cb48","id":"mohanrj/MEN-Malaysian_English_News_Article_Dataset","author":"mohanrj","disabled":false,"gated":false,"lastModified":"2024-03-27T15:19:25.000Z","likes":1,"trendingScore":1,"private":false,"sha":"01f15300155a475c76eea77d82f2220c24d4aedf","description":"Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian… See the full description on the dataset page: https://huggingface.co/datasets/mohanrj/MEN-Malaysian_English_News_Article_Dataset.","downloads":20,"tags":["arxiv:2402.14521","region:us"],"createdAt":"2024-03-27T15:14:32.000Z","key":""},{"_id":"66052a7e511831c2ded7c941","id":"criteo/criteo-attribution-dataset","author":"criteo","disabled":false,"gated":false,"lastModified":"2024-03-28T08:53:42.000Z","likes":8,"trendingScore":1,"private":false,"sha":"904188a63cbad78bee43cd26ff5ee4ac77903986","description":"\n\t\n\t\t\n\t\tCriteo Attribution Modeling for Bidding Dataset\n\t\n\nThis dataset is released along with the paper:\nAttribution Modeling Increases Efficiency of Bidding in Display Advertising\nEustache Diemert*, Julien Meynet* (Criteo Research), Damien Lefortier (Facebook), Pierre Galland (Criteo) *authors contributed equally\n2017 AdKDD & TargetAd Workshop, in conjunction with The 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) \nWhen using this dataset, please cite the paper… See the full description on the dataset page: https://huggingface.co/datasets/criteo/criteo-attribution-dataset.","downloads":650,"tags":["task_categories:tabular-classification","license:cc-by-nc-sa-4.0","size_categories:10M<n<100M","format:csv","modality:tabular","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:1707.06409","region:us","criteo","advertising"],"createdAt":"2024-03-28T08:29:50.000Z","key":""},{"_id":"6605475abe5f99d0cb824a15","id":"phonemetransformers/IPA-CHILDES","author":"phonemetransformers","disabled":false,"gated":false,"lastModified":"2025-04-08T14:22:09.000Z","likes":7,"trendingScore":1,"private":false,"sha":"2f4e63f61b3e4b11b470a511d6abcd18d1e3ad9e","description":"\n\t\n\t\t\n\t\tIPA-CHILDES Dataset\n\t\n\nThis dataset contains utterances downloaded from CHILDES which have been pre-processed and converted to a phonemic representation. Read the paper here. \n\n\t\n\t\t\n\t\tDescription\n\t\n\n\n\t\n\t\t\n\t\tKey Columns\n\t\n\nThe scripts used to create the dataset are available here. Many of the columns from CHILDES have been preserved as they are useful for experiments (e.g. number of morphemes, part-of-speech tags, etc.). The key columns added by the processing script are as follows:… See the full description on the dataset page: https://huggingface.co/datasets/phonemetransformers/IPA-CHILDES.","downloads":209,"tags":["language:en","language:de","language:fr","language:es","language:nl","language:cmn","language:ja","language:yue","language:et","language:hr","language:da","language:eu","language:hu","language:tr","language:fa","language:is","language:id","language:ga","language:cy","language:ko","language:sv","language:nb","language:qu","language:ca","language:it","language:pt","language:ro","language:pl","size_categories:10M<n<100M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2504.03036","arxiv:2504.03338","region:us","language modeling","cognitive modeling"],"createdAt":"2024-03-28T10:32:58.000Z","key":""},{"_id":"6605577e3fd3fdf5a4f68976","id":"KisanVaani/agriculture-qa-english-only","author":"KisanVaani","disabled":false,"gated":false,"lastModified":"2024-03-28T12:39:42.000Z","likes":23,"trendingScore":1,"private":false,"sha":"2f32fb399fbb71eb417235aeb349e637f785a0fd","description":"\n\t\n\t\t\n\t\tDataset Card for Dataset Name\n\t\n\n\n\nThis dataset contains question-answer pairs related to agriculture. The dataset can be used for tasks such as question answering, information retrieval, and natural language understanding in the agricultural domain. The questions cover various aspects of agriculture, including crop production, animal husbandry, soil management, and farming practices.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\nhe dataset is structured as a collection of JSON files, with each file… See the full description on the dataset page: https://huggingface.co/datasets/KisanVaani/agriculture-qa-english-only.","downloads":275,"tags":["task_categories:question-answering","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","Agriculture","agriculture_qa"],"createdAt":"2024-03-28T11:41:50.000Z","key":""},{"_id":"660a16c89e582313a14ff351","id":"xcodemind/webcode2m","author":"xcodemind","disabled":false,"gated":false,"lastModified":"2025-03-05T08:41:57.000Z","likes":46,"trendingScore":1,"private":false,"sha":"f53cd4a9317364e32a6fc7d99dc50761fe54715f","description":"WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs with Layouts\n(This dataset is also called Vision2UI.)\n\nAutomatically generating webpage code from webpage designscan significantly reduce the workload of front-end developers, andrecent Multimodal Large Language Models (MLLMs) have shownpromising potential in this area. However, our investigation revealsthat most existing MLLMs are constrained by the absence of highquality, large-scale, real-world datasets, resulting in… See the full description on the dataset page: https://huggingface.co/datasets/xcodemind/webcode2m.","downloads":3908,"tags":["task_categories:image-to-text","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2404.06369","region:us","code"],"createdAt":"2024-04-01T02:07:04.000Z","key":""},{"_id":"660add462617cc957fd1bc02","id":"stanford-cs336/owt-sample","author":"stanford-cs336","disabled":false,"gated":false,"lastModified":"2024-04-01T19:52:51.000Z","likes":5,"trendingScore":1,"private":false,"sha":"488afadefc224407de5645a3c8c53b64ed923537","description":"These files were created with the following script:\nfrom datasets import load_dataset\nfrom tqdm import tqdm\nimport io\n\ndataset = load_dataset(\"Skylion007/openwebtext\")['train']\nsplit_dataset = dataset.train_test_split(train_size=2400000, test_size=60000, seed=0)\n\n\nwith io.open('data/owt_train.txt','w') as fopen:\n    listout = []\n    for data in tqdm(split_dataset['train']):\n        listout.append(data['text']+'<|endoftext|>')\n        if len(listout) > 1000:\n            _ =… See the full description on the dataset page: https://huggingface.co/datasets/stanford-cs336/owt-sample.","downloads":2286,"tags":["size_categories:10M<n<100M","format:text","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2024-04-01T16:13:58.000Z","key":""},{"_id":"660c0b3addae775e2936c04a","id":"AdaptLLM/ConvFinQA","author":"AdaptLLM","disabled":false,"gated":false,"lastModified":"2024-11-30T08:26:09.000Z","likes":8,"trendingScore":1,"private":false,"sha":"0a323d962e70db017aca29be754c7bcbd2b66711","description":"\n\t\n\t\t\n\t\tAdapting Large Language Models to Domains via Continual Pre-Training\n\t\n\nThis repo contains the ConvFinQA dataset used in our ICLR 2024 paper Adapting Large Language Models via Reading Comprehension.\nWe explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/ConvFinQA.","downloads":318,"tags":["task_categories:text-classification","task_categories:question-answering","task_categories:zero-shot-classification","language:en","arxiv:2309.09530","arxiv:2406.14491","region:us","finance"],"createdAt":"2024-04-02T13:42:18.000Z","key":""},{"_id":"660e422fcd9f60734864a9b5","id":"capleaf/viVoice","author":"capleaf","disabled":false,"gated":"auto","lastModified":"2024-07-01T07:00:51.000Z","likes":90,"trendingScore":1,"private":false,"sha":"693e9882c8e73b31e96a723741fa1861c3252b75","description":"\n\t\n\t\t\n\t\tImportant Note ⚠️\n\t\n\nThis dataset is only to be used for research purposes. Access requests must be made via your school, institution, or work email. Requests from common email services will be rejected. We apologize for any inconvenience. \n\n\t\n\t\t\n\t\tviVoice: Enabling Vietnamese Multi-Speaker Speech Synthesis\n\t\n\nFor a comprehensive description, please visit https://github.com/thinhlpg/viVoice\nThis dataset is licensed under CC-BY-NC-SA-4.0 and is intended for research purposes only.… See the full description on the dataset page: https://huggingface.co/datasets/capleaf/viVoice.","downloads":2154,"tags":["task_categories:text-to-speech","language:vi","license:cc-by-nc-sa-4.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-04T06:01:19.000Z","key":""},{"_id":"660ebdfa0b8626bbe877cde1","id":"Manishsahu53/Solar-Panel-Thermal-Drone-UAV-Images","author":"Manishsahu53","disabled":false,"gated":false,"lastModified":"2025-01-15T13:47:21.000Z","likes":9,"trendingScore":1,"private":false,"sha":"48dffd93441149642b02d36f02474c013b913aee","description":"Git\nThis is Thermal Images of solar power plant captured using DJI drone in India.\nHow to read thermal values from Image:\n\nhttps://github.com/ManishSahu53/read_thermal_temperature\n\nHow to do automated hotspot detection:\n\nhttps://github.com/ManishSahu53/solarHotspotAnalysis\n\n","downloads":212,"tags":["task_categories:zero-shot-object-detection","language:en","language:hi","license:apache-2.0","modality:image","region:us","Thermal Images","Solar Panael","Drone","UAV"],"createdAt":"2024-04-04T14:49:30.000Z","key":""},{"_id":"66136f846b0b4dd0f0d87fba","id":"hitachi-nlp/ruletaker","author":"hitachi-nlp","disabled":false,"gated":false,"lastModified":"2024-04-08T04:21:33.000Z","likes":2,"trendingScore":1,"private":false,"sha":"24998c7e309709eaeb5d1ae1bdc40b06e43f5438","description":"\n\t\n\t\t\n\t\tDataset Card for \"ruletaker\"\n\t\n\nMore Information needed\n","downloads":229,"tags":["size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-08T04:16:04.000Z","key":""},{"_id":"661386481f4995fba8857e77","id":"McAuley-Lab/Amazon-C4","author":"McAuley-Lab","disabled":false,"gated":false,"lastModified":"2024-04-09T04:13:44.000Z","likes":8,"trendingScore":1,"private":false,"sha":"39322697749a88d179f88d322a2fe4765b655c98","description":"\n\t\n\t\t\n\t\tAmazon-C4\n\t\n\nA complex product search dataset built based on Amazon Reviews 2023 dataset.\nC4 is short for Complex Contexts Created by ChatGPT.\n\n\t\n\t\t\n\t\tQuick Start\n\t\n\n\n\t\n\t\t\n\t\tLoading Queries\n\t\n\nfrom datasets import load_dataset\ndataset = load_dataset('McAuley-Lab/Amazon-C4')['test']\n\n>>> dataset\nDataset({\n    features: ['qid', 'query', 'item_id', 'user_id', 'ori_rating', 'ori_review'],\n    num_rows: 21223\n})\n\n>>> dataset[288]\n{'qid': 288, 'query': 'I need something that can entertain my… See the full description on the dataset page: https://huggingface.co/datasets/McAuley-Lab/Amazon-C4.","downloads":228,"tags":["language:en","size_categories:10K<n<100K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2403.03952","region:us","instruction-following","recommendation","product search"],"createdAt":"2024-04-08T05:53:12.000Z","key":""},{"_id":"66150427cf3fef4fa8656274","id":"LooksJuicy/ruozhiba","author":"LooksJuicy","disabled":false,"gated":false,"lastModified":"2024-04-09T09:10:55.000Z","likes":319,"trendingScore":1,"private":false,"sha":"2a39d86721e0109a7c598a25a1338e297c639d2f","description":"受COIG-CQIA启发，构建类似数据集，但答案风格相对更简洁。\n弱智吧精选问题数据来自github提供的疑问句，调用GPT-4获取答案，并过滤掉明显拒答的回复。\n","downloads":312,"tags":["task_categories:text-generation","language:zh","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-09T09:02:31.000Z","key":""},{"_id":"661680d2b3d0b21da598afe6","id":"Wenetspeech4TTS/WenetSpeech4TTS","author":"Wenetspeech4TTS","disabled":false,"gated":"auto","lastModified":"2024-07-25T11:56:49.000Z","likes":87,"trendingScore":1,"private":false,"sha":"5e3ea1bdeb573401ef876cfb6cdd31d198c104fc","citation":"\\","description":"WenetSpeech4TTS is a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. \nTailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. \nFollowing a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12,800 hours of paired audio-text data. \nFurthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and finetuning.","downloads":2138,"tags":["task_categories:text-to-speech","multilinguality:monolingual","language:zh","license:cc-by-4.0","size_categories:10M<n<100M","arxiv:2110.03370","arxiv:2301.02111","arxiv:2304.09116","arxiv:2406.05763","region:us"],"createdAt":"2024-04-10T12:06:42.000Z","key":""},{"_id":"6616ab2d8068f800ac6e8f62","id":"jamessyx/PathInstruct","author":"jamessyx","disabled":false,"gated":"auto","lastModified":"2024-04-13T17:08:01.000Z","likes":13,"trendingScore":1,"private":false,"sha":"8758dc0630a46c8df488b107eaab3eb95890860b","description":"This is the official Hugging Face repo for PathInstruct dataset.\n\n\t\n\t\t\n\t\tCitation\n\t\n\n@article{sun2023pathasst,\n  title={Pathasst: Redefining pathology through generative foundation ai assistant for pathology},\n  author={Sun, Yuxuan and Zhu, Chenglu and Zheng, Sunyi and Zhang, Kai and Shui, Zhongyi and Yu, Xiaoxuan and Zhao, Yizhi and Li, Honglin and Zhang, Yunlong and Zhao, Ruojia and others},\n  journal={arXiv preprint arXiv:2305.15072},\n  year={2023}\n}\n\n","downloads":2,"tags":["license:cc-by-nc-2.0","arxiv:2305.15072","region:us"],"createdAt":"2024-04-10T15:07:25.000Z","key":""},{"_id":"661823b590a8b6724f1c6534","id":"HuggingFaceM4/the_cauldron","author":"HuggingFaceM4","disabled":false,"gated":false,"lastModified":"2024-05-06T13:37:52.000Z","likes":547,"trendingScore":1,"private":false,"sha":"847a98a779b1652d65111daf20c972dfcd333605","description":"\n\t\n\t\t\n\t\tDataset Card for The Cauldron\n\t\n\n\n\n\t\n\t\t\n\t\tDataset description\n\t\n\nThe Cauldron is part of the Idefics2 release.\nIt is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.\n\n\t\n\t\t\n\t\tLoad the dataset\n\t\n\nTo load the dataset, install the library datasets with pip install datasets. Then,\nfrom datasets import load_dataset\nds = load_dataset(\"HuggingFaceM4/the_cauldron\", \"ai2d\")\n\nto download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.","downloads":280942,"tags":["size_categories:1M<n<10M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:1603.07396","arxiv:2206.01718","arxiv:2208.05358","arxiv:1612.06890","arxiv:2310.00367","arxiv:1710.07300","arxiv:2312.12241","arxiv:1912.03098","arxiv:2211.08545","arxiv:2306.05425","arxiv:1709.00103","arxiv:2003.12462","arxiv:1612.00837","arxiv:2205.00363","arxiv:2403.09029","arxiv:2405.02246","region:us"],"createdAt":"2024-04-11T17:53:57.000Z","key":""},{"_id":"661da7499e21f5e867f74edb","id":"codesagar/malicious-llm-prompts","author":"codesagar","disabled":false,"gated":false,"lastModified":"2024-04-15T23:09:23.000Z","likes":8,"trendingScore":1,"private":false,"sha":"4f1b5a1fdf996cb41ceeee5b142cb0fb2b92251d","description":"\n\t\n\t\t\n\t\tDataset Card for \"malicious-llm-prompts\"\n\t\n\nMore Information needed\n","downloads":242,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-15T22:16:41.000Z","key":""},{"_id":"661f164cad7da5317914372b","id":"yasamanne/math-fine-tune","author":"yasamanne","disabled":false,"gated":false,"lastModified":"2024-04-17T00:46:13.000Z","likes":2,"trendingScore":1,"private":false,"sha":"9a73a0da428eeebb9ffe1e57c5235412d4cf37f7","downloads":10,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-17T00:22:36.000Z","key":""},{"_id":"661fc83b1666119bd4c16a01","id":"argilla/Capybara-Preferences","author":"argilla","disabled":false,"gated":false,"lastModified":"2024-05-09T08:44:55.000Z","likes":47,"trendingScore":1,"private":false,"sha":"b049c23c217369b39f179a617b47dc9ad4a46df9","description":"\n  \n    \n  \n\n\n\n\t\n\t\t\n\t\tDataset Card for Capybara-Preferences\n\t\n\nThis dataset has been created with distilabel.\n\n    \n\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset is built on top of LDJnr/Capybara, in order to generate a preference\ndataset out of an instruction-following dataset. This is done by keeping the conversations in the column conversation but splitting\nthe last assistant turn from it, so that the conversation contains all the turns up until the last user's turn, so that it can be reused… See the full description on the dataset page: https://huggingface.co/datasets/argilla/Capybara-Preferences.","downloads":158,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","library:distilabel","region:us","preferences","distilabel","synthetic","dpo","orpo"],"createdAt":"2024-04-17T13:01:47.000Z","key":""},{"_id":"662005a74360f44332b11379","id":"mlabonne/orpo-dpo-mix-40k","author":"mlabonne","disabled":false,"gated":false,"lastModified":"2024-10-17T21:44:52.000Z","likes":303,"trendingScore":1,"private":false,"sha":"0f72511202b8f093e9be60e1683d84b046062e36","description":"\n\t\n\t\t\n\t\tORPO-DPO-mix-40k v1.2\n\t\n\n\nThis dataset is designed for ORPO or DPO training.\nSee Fine-tune Llama 3 with ORPO for more information about how to use it.\nIt is a combination of the following high-quality DPO datasets:\n\nargilla/Capybara-Preferences: highly scored chosen answers >=5 (7,424 samples)argilla/distilabel-intel-orca-dpo-pairs: highly scored chosen answers >=9, not in GSM8K (2,299 samples)\nargilla/ultrafeedback-binarized-preferences-cleaned: highly scored chosen answers >=5 (22… See the full description on the dataset page: https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k.","downloads":1017,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","dpo","rlhf","preference","orpo"],"createdAt":"2024-04-17T17:23:51.000Z","key":""},{"_id":"66221a7f11c923d051114326","id":"Lyte/2A2I-Arabic-OpenHermes-2.5-Llama-3","author":"Lyte","disabled":false,"gated":false,"lastModified":"2024-04-19T07:22:53.000Z","likes":2,"trendingScore":1,"private":false,"sha":"7e685a2b69142ab629c0e028f14f63d6126c9eb1","description":"\n\t\n\t\t\n\t\tDataset Card for \"2A2I-Arabic-OpenHermes-2.5-Llama-3\"\n\t\n\n\n\t\n\t\t\n\t\tDataset Sources & Infos\n\t\n\n\nData Origin: Derived from the original Arabic OpenHermes dataset : 2A2I/Arabic-OpenHermes-2.5.\nLanguages: Modern Standard Arabic (MSA)\nApplications: Language Modeling\nLicense: Apache-2.0\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\n2A2I-Arabic-OpenHermes-2.5-Llama is a Llama-3 compatible dataset carefully converted from the 2A2I's Arabic-OpenHermes-2.5 collection provided by Lyte.\n\n\t\n\t\t\n\t\tPurpose… See the full description on the dataset page: https://huggingface.co/datasets/Lyte/2A2I-Arabic-OpenHermes-2.5-Llama-3.","downloads":66,"tags":["task_categories:question-answering","language:ar","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-19T07:17:19.000Z","key":""},{"_id":"66252d166046dd4b85e95c9c","id":"naklecha/minecraft-question-answer-700k","author":"naklecha","disabled":false,"gated":false,"lastModified":"2024-04-22T11:19:01.000Z","likes":45,"trendingScore":1,"private":false,"sha":"8c2a6a253cbfcbbf93d9f14e535736557cdf3762","description":"\n\t\n\t\t\n\t\tminecraft-question-answer-700k\n\t\n\nIntroducing the largest synthetic Minecraft Q&A dataset, covering every topic, game mechanic, item and craft in Minecraft. The dataset was generated by extracting over 18,000 Minecraft wiki pages, and using glaive.ai's synthetic data generation pipeline.\n\n\t\n\t\t\n\t\tabout the dataset\n\t\n\n\nrows - 694,814\ntokens - 47,133,624\nsource - https://minecraft.wiki/\n\nHit me up on twitter if you see a bug or need a synthetic dataset for your company:… See the full description on the dataset page: https://huggingface.co/datasets/naklecha/minecraft-question-answer-700k.","downloads":101,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:cc-by-nc-sa-3.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","llama3","minecraft","gaming","text generation","llm","q&a"],"createdAt":"2024-04-21T15:13:26.000Z","key":""},{"_id":"6626a7678f7573e6a64b0f3c","id":"common-canvas/commoncatalog-cc-by","author":"common-canvas","disabled":false,"gated":false,"lastModified":"2024-05-16T19:01:29.000Z","likes":39,"trendingScore":1,"private":false,"sha":"80f50fe4a1ca937f37a11be3f8eee5199d776ff3","description":"\n\t\n\t\t\n\t\tDataset Card for CommonCatalog CC-BY\n\t\n\nThis dataset is a large collection of high-resolution Creative Common images (composed of different licenses, see paper Table 1 in the Appendix) collected in 2014 from users of Yahoo Flickr. \nThe dataset contains images of up to 4k resolution, making this one of the highest resolution captioned image datasets.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nWe provide captions synthetic captions to approximately 100 million high… See the full description on the dataset page: https://huggingface.co/datasets/common-canvas/commoncatalog-cc-by.","downloads":19190,"tags":["task_categories:text-to-image","language:en","license:cc-by-4.0","size_categories:10M<n<100M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2310.16825","region:us"],"createdAt":"2024-04-22T18:07:35.000Z","key":""},{"_id":"6626a96c60995500ad5bc2cc","id":"HannahRoseKirk/prism-alignment","author":"HannahRoseKirk","disabled":false,"gated":false,"lastModified":"2024-04-25T09:11:35.000Z","likes":103,"trendingScore":1,"private":false,"sha":"18ab5cfb37456f4ec8cbc00212ce54cf7b1239f6","description":"\n\t\n\t\t\n\t\tDataset Card for PRISM\n\t\n\nPRISM is a diverse human feedback dataset for preference and value alignment in Large Language Models (LLMs).\nIt maps the characteristics and stated preferences of humans from a detailed survey onto their real-time interactions with LLMs and contextual preference ratings\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\nThere are two sequential stages: first, participants complete a Survey where they answer questions about their demographics and stated preferences, then proceed to… See the full description on the dataset page: https://huggingface.co/datasets/HannahRoseKirk/prism-alignment.","downloads":865,"tags":["language:en","license:cc","size_categories:10K<n<100K","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2404.16019","doi:10.57967/hf/2113","region:us","alignment","human-feedback","ratings","preferences","ai-safety","llm","survey","fine-grained"],"createdAt":"2024-04-22T18:16:12.000Z","key":""},{"_id":"66289495d39d4ec43fe198c6","id":"Lyntas/mining_domain_specific_terminology","author":"Lyntas","disabled":false,"gated":false,"lastModified":"2024-05-03T02:17:00.000Z","likes":1,"trendingScore":1,"private":false,"sha":"837b0dbbedcef8a82f95401027278b06a9c8dd1f","downloads":24,"tags":["size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-24T05:11:49.000Z","key":""},{"_id":"6628fe9ec672e80208f068bb","id":"dpdl-benchmark/omniglot","author":"dpdl-benchmark","disabled":false,"gated":false,"lastModified":"2024-04-26T09:42:05.000Z","likes":1,"trendingScore":1,"private":false,"sha":"cc0366f1765d565d3e070b01f94e5d17cd234a0a","downloads":252,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-24T12:44:14.000Z","key":""},{"_id":"662bbf5f8082c634bb84f29e","id":"masakhane/afrimmlu","author":"masakhane","disabled":false,"gated":false,"lastModified":"2025-04-15T03:05:01.000Z","likes":12,"trendingScore":1,"private":false,"sha":"96f247619673906ae4c321a3232afc67b98ea57e","description":"\n\t\n\t\t\n\t\tDataset Card for afrimmlu\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nAFRIMMLU is an evaluation dataset comprising translations of a subset of the MMLU dataset into 15 African languages. \nIt includes test sets across all 17 languages, maintaining an English and French subsets from the original MMLU dataset. \n\n\t\n\t\t\n\t\tLanguages\n\t\n\nThere are 17 languages available :\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\nThe examples look like this for English:\nfrom datasets import load_dataset\ndata =… See the full description on the dataset page: https://huggingface.co/datasets/masakhane/afrimmlu.","downloads":1451,"tags":["task_categories:question-answering","task_ids:multiple-choice-qa","multilinguality:multilingual","source_datasets:mmlu","language:am","language:ee","language:ha","language:ig","language:kin","language:ln","language:lug","language:orm","language:sna","language:sot","language:tw","language:wo","language:xh","language:yo","language:zu","language:en","language:fr","language:sw","license:apache-2.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","afrimmlu","afri-mmlu","africanmmlu"],"createdAt":"2024-04-26T14:51:11.000Z","key":""},{"_id":"662bf9fdbf97b6979583aee9","id":"simon3000/starrail-voice","author":"simon3000","disabled":false,"gated":false,"lastModified":"2024-08-30T04:52:04.000Z","likes":61,"trendingScore":1,"private":false,"sha":"7b6e8050a0fd1e7e052de9520f1799faa08a5751","description":"\n\t\n\t\t\n\t\tStarRail Voice\n\t\n\nStarRail Voice is a dataset of voice lines from the popular game Honkai: Star Rail.\nHugging Face 🤗  StarRail-Voice\n\nLast update at 2024-08-30\n185511 wavs\n49325 without speaker (27%)\n49409 without transcription (27%)\n41142 without inGameFilename (22%)\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nThe dataset contains voice lines from the game's characters in multiple languages, including Chinese, English, Japanese, and Korean.\nThe voice lines are… See the full description on the dataset page: https://huggingface.co/datasets/simon3000/starrail-voice.","downloads":951,"tags":["task_categories:audio-classification","task_categories:automatic-speech-recognition","task_categories:text-to-speech","language:zh","language:en","language:ja","language:ko","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-26T19:01:17.000Z","key":""},{"_id":"662ca2fc259fa63f7703ed6a","id":"vgdasfgadg/1","author":"vgdasfgadg","disabled":false,"gated":false,"lastModified":"2024-04-27T07:17:46.000Z","likes":3,"trendingScore":1,"private":false,"sha":"1b7d2d4e73a9ea5a22070a30434ab295f1dbc94f","downloads":96,"tags":["size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-04-27T07:02:20.000Z","key":""},{"_id":"663412cf30c0652a8ade0717","id":"Francisco-Cruz/InvoicesReceiptsPT","author":"Francisco-Cruz","disabled":false,"gated":false,"lastModified":"2024-05-02T22:50:39.000Z","likes":8,"trendingScore":1,"private":false,"sha":"a3f7a8ad4b8cfd37495293153f14e22d072327ea","description":"This is a dataset comprising 1003 images of invoices and receipts, as well as the transcription of relevant fields for each document – seller name, seller address, seller tax identification, buyer tax identification, invoice date, invoice total amount, invoice tax amount, and document reference. \nIt is organized as:\n\nfolder 1_Images: files with pictures od the invoices/receipts \nfolder 2_Annotations_Json: text files with the annotations on a json format\n\nAlso available at:… See the full description on the dataset page: https://huggingface.co/datasets/Francisco-Cruz/InvoicesReceiptsPT.","downloads":1253,"tags":["task_categories:text-classification","language:pt","license:apache-2.0","size_categories:1K<n<10K","format:imagefolder","modality:image","modality:text","library:datasets","library:mlcroissant","region:us","finance"],"createdAt":"2024-05-02T22:25:19.000Z","key":""},{"_id":"66347aa61500e67c72dedeb0","id":"allenai/WildChat-1M","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-10-17T18:04:41.000Z","likes":443,"trendingScore":1,"private":false,"sha":"7d6490e462285cf85d91eabea0f9a954fbddcd1f","description":"\n\t\n\t\t\n\t\tDataset Card for WildChat\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\nPaper: https://arxiv.org/abs/2405.01470\n\nInteractive Search Tool: https://wildvisualizer.com (paper)\n\nLicense: ODC-BY\n\nLanguage(s) (NLP): multi-lingual\n\nPoint of Contact: Yuntian Deng\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nWildChat is a collection of 1 million conversations between human users and ChatGPT, alongside demographic data, including state, country, hashed IP addresses, and request headers. We collected WildChat by… See the full description on the dataset page: https://huggingface.co/datasets/allenai/WildChat-1M.","downloads":30210,"tags":["task_categories:text-generation","task_categories:question-answering","license:odc-by","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2405.01470","arxiv:2409.03753","arxiv:2406.13706","region:us","instruction-finetuning"],"createdAt":"2024-05-03T05:48:22.000Z","key":""},{"_id":"6639d8ee06b25a7ea6f2ebf5","id":"kinokokoro/ichikara-instruction-003","author":"kinokokoro","disabled":false,"gated":false,"lastModified":"2024-05-07T07:34:16.000Z","likes":1,"trendingScore":1,"private":false,"sha":"3d424e6e198e8abca5e882ea44f74602a1d0db51","description":"CC-BY-NC-SA License\nSee: http://liat-aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/… See the full description on the dataset page: https://huggingface.co/datasets/kinokokoro/ichikara-instruction-003.","downloads":35,"tags":["license:cc","region:us"],"createdAt":"2024-05-07T07:31:58.000Z","key":""},{"_id":"663b1cafa55b0634633906c8","id":"sujet-ai/Sujet-Finance-Vision-10k","author":"sujet-ai","disabled":false,"gated":false,"lastModified":"2024-05-12T19:58:33.000Z","likes":18,"trendingScore":1,"private":false,"sha":"c2dda31909572a8d6210355e496ff185e75ba94c","description":"\n\t\n\t\t\n\t\tSujet Finance Vision 10k Dataset\n\t\n\n\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThe Sujet Finance Vision 10k dataset is a comprehensive collection of financial document images along with their associated textual annotations. This dataset is specifically designed to facilitate the training and evaluation of Vision-Language Models (VLMs) in recognizing and describing various types of financial documents.\n\n\t\n\t\t\n\t\tImage Characteristics\n\t\n\nThe dataset consists of 9819 handpicked images of financial… See the full description on the dataset page: https://huggingface.co/datasets/sujet-ai/Sujet-Finance-Vision-10k.","downloads":181,"tags":["task_categories:image-to-text","language:en","license:apache-2.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","finance"],"createdAt":"2024-05-08T06:33:19.000Z","key":""},{"_id":"663b1e0aca4b108fad836683","id":"ajibawa-2023/Maths-College","author":"ajibawa-2023","disabled":false,"gated":false,"lastModified":"2024-05-08T13:15:09.000Z","likes":53,"trendingScore":1,"private":false,"sha":"9eaa32a8a7bb06ac208dff3650024747e75d7999","description":"Maths-College\nI am releasing a large Mathematics dataset in the instrution format. \nThis extensive dataset, comprising nearly one million instructions in JSON format, encapsulates a wide array of mathematical disciplines essential for a profound understanding of the subject.\nThis dataset is very useful to Researchers & Model developers.\nFollowing Fields & sub Fields are covered:\nProbability\nStatistics\nLiner Algebra\nAlgebra\nGroup Theory\nTopology\nAbstract Algebra\nGraph Theory\nCombinatorics… See the full description on the dataset page: https://huggingface.co/datasets/ajibawa-2023/Maths-College.","downloads":103,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","doi:10.57967/hf/3166","region:us","Maths","Mathematics","Probability","Statistics","Liner Algebra","Algebra","Group Theory","Topology","Abstract Algebra","Graph Theory","Test Preparations","Combinatorics","Differential Equations","Calculus","Algorithms","Datastructures","Matrix Algebra"],"createdAt":"2024-05-08T06:39:06.000Z","key":""},{"_id":"663e283ee4bc02e697c71e51","id":"eashuu/medical_qa","author":"eashuu","disabled":false,"gated":false,"lastModified":"2024-05-10T14:13:29.000Z","likes":3,"trendingScore":1,"private":false,"sha":"e23ff3d302b7db1409899b71f0f71cc3cceaeb81","description":"\n\t\n\t\t\n\t\tDataset Card for Dataset Name\n\t\n\n\n\nThis dataset card aims to be a base template for new datasets. It has been generated using this raw template.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\n\n\n\nCurated by: [More Information Needed]\nFunded by [optional]: [More Information Needed]\nShared by [optional]: [More Information Needed]\nLanguage(s) (NLP): [More Information Needed]\nLicense: [More Information Needed]\n\n\n\t\n\t\t\n\t\tDataset Sources [optional]\n\t\n\n\n\n\nRepository: [More… See the full description on the dataset page: https://huggingface.co/datasets/eashuu/medical_qa.","downloads":13,"tags":["size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-05-10T13:59:26.000Z","key":""},{"_id":"66426d5d1eb01eb3d86bb39a","id":"ewok-core/ewok-core-1.0","author":"ewok-core","disabled":false,"gated":"auto","lastModified":"2024-05-23T19:48:59.000Z","likes":11,"trendingScore":1,"private":false,"sha":"34d912a608066c92e2990a0328ffc3bd9a716042","description":"\n\t\n\t\t\n\t\tEWoK-core-1.0\n\t\n\nThe repository hosts data from the paper\nElements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models\n\n\t\n\t\t\n\t\tWhat is EWoK?\n\t\n\nThe ability to build and leverage world models is essential for a general-purpose AI agent. \nTesting such capabilities can be hard, in part because the building blocks of world models \nare ill-defined. Here, we present Elements of World Knowledge (EWOK), a framework for evaluating… See the full description on the dataset page: https://huggingface.co/datasets/ewok-core/ewok-core-1.0.","downloads":950,"tags":["language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2405.09605","region:us","world knowledge","plausibility","world modeling"],"createdAt":"2024-05-13T19:43:25.000Z","key":""},{"_id":"66430ee73f7c03e9beee9af2","id":"NghiemAbe/ViWikiFC","author":"NghiemAbe","disabled":false,"gated":"manual","lastModified":"2024-05-14T07:20:48.000Z","likes":1,"trendingScore":1,"private":false,"sha":"dc0bdc410a7c55ecf2dd137981d2393c1d453549","description":"\n\t\n\t\t\n\t\tViWikiFC: Fact-Checking for Vietnamese Wikipedia-Based Textual Knowledge Source\n\t\n\nFact-Checking: is task which aim to verify the truthfulness of a statement (Claim) based on the information from trustworthy sources (Evidence).\nViWikiFC is the first large-scale, open-domain corpus for Vietnamese Fact-Checking on Wikipedia. The corpus consists of 20,916 claims manually annotated and based on evidence retrieved from Wikipedia pages. \nOur corpus have three label classes which take… See the full description on the dataset page: https://huggingface.co/datasets/NghiemAbe/ViWikiFC.","downloads":3,"tags":["task_categories:text-retrieval","task_categories:text-classification","language:vi","license:mit","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2405.07615","region:us"],"createdAt":"2024-05-14T07:12:39.000Z","key":""},{"_id":"6645f88d52f167b48fd7133a","id":"Hacker0x01/hackerone_disclosed_reports","author":"Hacker0x01","disabled":false,"gated":false,"lastModified":"2024-05-17T07:10:48.000Z","likes":26,"trendingScore":1,"private":false,"sha":"9620da2e40cac1901f751993eebe128fa356bc77","description":"\n\t\n\t\t\n\t\tHackerOne Disclosed Reports Dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Card for HackerOne Disclosed Reports\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset contains all disclosed reports from HackerOne, a leading vulnerability coordination and bug bounty platform. Each report includes comprehensive details about discovered security vulnerabilities, such as descriptions, steps to reproduce, and remediation actions.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nThis dataset can be used for the following tasks:… See the full description on the dataset page: https://huggingface.co/datasets/Hacker0x01/hackerone_disclosed_reports.","downloads":418,"tags":["region:us"],"createdAt":"2024-05-16T12:14:05.000Z","key":""},{"_id":"66460b07e3725fe93cb0c9f1","id":"Lab-Rasool/TCGA","author":"Lab-Rasool","disabled":false,"gated":false,"lastModified":"2025-07-08T15:15:32.000Z","likes":18,"trendingScore":1,"private":false,"sha":"e15d84df73d5af580e0e277a9536b97038cb8222","description":"\n\t\n\t\t\n\t\tDataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset\n\t\n\n\n\nThe Cancer Genome Atlas (TCGA) Multimodal Dataset is a comprehensive collection of clinical data, pathology reports, slide images, molecular data, and radiology images for cancer patients. \nThis dataset aims to facilitate research in multimodal machine learning for oncology by providing embeddings generated using state-of-the-art models including GatorTron, MedGemma, Qwen, Llama, UNI, SeNMo, REMEDIS, and… See the full description on the dataset page: https://huggingface.co/datasets/Lab-Rasool/TCGA.","downloads":1448,"tags":["language:en","license:cc-by-nc-nd-4.0","size_categories:100K<n<1M","format:parquet","modality:text","modality:timeseries","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2405.07460","arxiv:2405.08226","region:us","medical","multimodal","tcga","oncology"],"createdAt":"2024-05-16T13:32:55.000Z","key":""},{"_id":"66474ad376c8a1fff681dffa","id":"llamafactory/alpaca_gpt4_zh","author":"llamafactory","disabled":false,"gated":false,"lastModified":"2024-06-07T18:46:07.000Z","likes":20,"trendingScore":1,"private":false,"sha":"065394ee242c43928298de8f43a8748ffd16f3e3","description":"Borrowed from: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM\nRemoved 6,103 mistruncated examples.\nYou can use it in LLaMA Factory by specifying dataset: alpaca_gpt4_zh.\n","downloads":2291,"tags":["task_categories:text-generation","task_categories:question-answering","language:zh","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","llama-factory"],"createdAt":"2024-05-17T12:17:23.000Z","key":""},{"_id":"664a388bb5e5f95dc6e71800","id":"teddylee777/QA-Dataset-mini","author":"teddylee777","disabled":false,"gated":false,"lastModified":"2024-05-26T15:36:42.000Z","likes":5,"trendingScore":1,"private":false,"sha":"112ac542e372b8793b793597bab233f8705c0dbe","downloads":70,"tags":["region:us"],"createdAt":"2024-05-19T17:36:11.000Z","key":""},{"_id":"664b8900533c5d0c9c40367e","id":"Zainabsa99/cyberattack2","author":"Zainabsa99","disabled":false,"gated":false,"lastModified":"2024-05-20T18:37:25.000Z","likes":2,"trendingScore":1,"private":false,"sha":"cc3770c7cb75d3bc07f77918238248cad84b9916","downloads":34,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-05-20T17:31:44.000Z","key":""},{"_id":"665048a98286a86212c5ede9","id":"Kittech/mixed_shona_dataset","author":"Kittech","disabled":false,"gated":false,"lastModified":"2024-05-24T10:40:46.000Z","likes":3,"trendingScore":1,"private":false,"sha":"95e2c9244c513c108b4107bb40ad5f099b14ed4c","downloads":47,"tags":["task_categories:text-generation","task_categories:text-classification","task_categories:automatic-speech-recognition","language:sn","language:en","license:mit","size_categories:n<1K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","code"],"createdAt":"2024-05-24T07:58:33.000Z","key":""},{"_id":"6655333c3b3f14a35b06f34f","id":"jhu-clsp/CLERC","author":"jhu-clsp","disabled":false,"gated":false,"lastModified":"2024-06-02T14:44:46.000Z","likes":11,"trendingScore":1,"private":false,"sha":"ef042f8ab436f78704f17faa0a866d1b2b862f6f","description":"README in progress\n\n\t\n\t\t\n\t\tUsage\n\t\n\nThe dataset is in folder according to the task and type (e.g. generation or collection for IR).\nYou can load a specific file (say the test set of the generation task) with this command:\nfrom datasets import load_dataset\ndataset = load_dataset(\"jhu-clsp/CLERC\", data_files={\"data\": f\"generation/test.jsonl\"})[\"data\"]\n\nChange the values in data_files to match the file you want to load.\n","downloads":381,"tags":["task_categories:text-generation","language:en","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","report-generation","information retrieval","retrieval","generation","legal","law"],"createdAt":"2024-05-28T01:28:28.000Z","key":""},{"_id":"6656cd64b7e6a9f0955f0a09","id":"CardinalOperations/IndustryOR","author":"CardinalOperations","disabled":false,"gated":false,"lastModified":"2026-05-03T10:05:22.000Z","likes":27,"trendingScore":1,"private":false,"sha":"b00811923f167cdf4ae1677c33cd61d4eea78f48","description":"\n\t\n\t\t\n\t\tOverview\n\t\n\nIndustryOR, the first industrial benchmark, consists of 100 real-world OR problems. It covers 5 types of questions—linear programming, integer programming, mixed integer programming, non-linear programming, and others—across 3 levels of difficulty.\n\n\t\n\t\t\n\t\tCitation\n\t\n\n@article{tang2024orlm,\n  title={ORLM: Training Large Language Models for Optimization Modeling},\n  author={Tang, Zhengyang and Huang, Chenyu and Zheng, Xin and Hu, Shixi and Wang, Zizhuo and Ge, Dongdong and… See the full description on the dataset page: https://huggingface.co/datasets/CardinalOperations/IndustryOR.","downloads":341,"tags":["language:en","license:cc-by-nc-4.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2405.17743","region:us"],"createdAt":"2024-05-29T06:38:28.000Z","key":""},{"_id":"665c2dfe3c0e54bafaa35eaf","id":"khoipd/Violence","author":"khoipd","disabled":false,"gated":false,"lastModified":"2024-06-11T07:01:12.000Z","likes":1,"trendingScore":1,"private":false,"sha":"b3632558b31f66374d187cc0980afbd193f1709c","description":"\n\t\n\t\t\n\t\tDataset Card for Dataset Name\n\t\n\n\n\nThis dataset card aims to be a base template for new datasets. It has been generated using this raw template.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\n\n\n\nCurated by: [More Information Needed]\nFunded by [optional]: [More Information Needed]\nShared by [optional]: [More Information Needed]\nLanguage(s) (NLP): [More Information Needed]\nLicense: [More Information Needed]\n\n\n\t\n\t\t\n\t\tDataset Sources [optional]\n\t\n\n\n\n\nRepository: [More… See the full description on the dataset page: https://huggingface.co/datasets/khoipd/Violence.","downloads":23,"tags":["size_categories:1K<n<10K","modality:video","library:datasets","library:mlcroissant","region:us"],"createdAt":"2024-06-02T08:31:58.000Z","key":""},{"_id":"665e2bb3fbaa279db369a409","id":"apple/DataCompDR-12M","author":"apple","disabled":false,"gated":false,"lastModified":"2026-04-20T23:02:17.000Z","likes":36,"trendingScore":1,"private":false,"sha":"adb713ff04db6a7f5792a929c2e3547ccb085099","description":"\n\t\n\t\t\n\t\tDataset Card for DataCompDR-12M\n\t\n\n\n\nThis dataset contains synthetic captions, embeddings, and metadata for DataCompDR-12M.\nThe metadata has been generated using pretrained image-text models on a 12M subset of DataComp-1B.\nFor details on how to use the metadata, please visit our github repository.\nThe dataset with the original captions is now available at mlfoundations/DataComp-12M.\nThe UIDs per shards match between mlfoundations/DataComp-12M and apple/DataCompDR-12M.\n\n\t\n\t\n\t\n\t\tDataset… See the full description on the dataset page: https://huggingface.co/datasets/apple/DataCompDR-12M.","downloads":2167,"tags":["task_categories:text-to-image","task_categories:image-to-text","language:en","license:apple-amlr","size_categories:10M<n<100M","format:webdataset","modality:image","modality:text","library:datasets","library:webdataset","library:mlcroissant","arxiv:2311.17049","region:us"],"createdAt":"2024-06-03T20:46:43.000Z","key":""},{"_id":"665e64e91ab07f22d245ec9c","id":"saillab/alpaca_shona_taco","author":"saillab","disabled":false,"gated":false,"lastModified":"2024-09-20T22:08:59.000Z","likes":5,"trendingScore":1,"private":false,"sha":"27bacc71ba8450ad151dba6b98918c65f244cac7","description":"This repository contains the dataset used for the TaCo paper.\nThe dataset follows the style outlined in the TaCo paper, as follows:\n{\n\"instruction\": \"instruction in xx\",\n\"input\": \"input in xx\",\n\"output\": \"Instruction in English: instruction in en , \n            Response in English: response in en ,\n            Response in xx: response in xx \"\n}\n\nPlease refer to the paper for more details: OpenReview\nIf you have used our dataset, please cite it as follows:\nCitation… See the full description on the dataset page: https://huggingface.co/datasets/saillab/alpaca_shona_taco.","downloads":27,"tags":["language:sn","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-04T00:50:49.000Z","key":""},{"_id":"665eeb1bd2f2de255e224349","id":"OOPPEENN/OnlineGame_Dataset","author":"OOPPEENN","disabled":false,"gated":"auto","lastModified":"2025-10-10T08:05:57.000Z","likes":19,"trendingScore":1,"private":false,"sha":"34d8bf3a692efdbc462dea9a1392ea804ac110ce","description":"\n\t\n\t\t\n\t\t0x0 使用协议：\n\t\n\n\n[!IMPORTANT]禁止商用，本数据集以及使用本数据集训练出来的任何模型都不得用于任何商业行为，如要用于商业用途，请找列表内的公司授权（笑），因违反开源协议而出现的任何问题都与本人无关！\n\n\n\t\n\t\t\n\t\t0x1 数据说明：\n\t\n\n标注说明：标注，说话人和对应的音频是直接读游戏引擎的脚本生成的，应该是100%准确率，全部存放在index.json里面，如果还有错误可以在开issues反馈（有些遗漏的控制符可能没洗干净）。\n务必根据index.json里面的键值对找音频，不在index内的音频请直接丢弃，说话人为？？？，？？等的请直接丢弃。\n数据语言：简体中文，日本語，English，한국어\n\n\t\n\t\t\n\t\t0x2 下载说明：\n\t\n\nhf download --repo-type dataset OOPPEENN/OnlineGame_Dataset --local-dir OnlineGame_Dataset\n\n不要用git直接拉仓库，git lfs会占用双倍空间！\n\n\t\n\t\t\n\t\t0x3 其他：… See the full description on the dataset page: https://huggingface.co/datasets/OOPPEENN/OnlineGame_Dataset.","downloads":27,"tags":["license:gpl-3.0","region:us"],"createdAt":"2024-06-04T10:23:23.000Z","key":""},{"_id":"66609ca86fcdbe4a6b7fd282","id":"oscar-corpus/mOSCAR","author":"oscar-corpus","disabled":false,"gated":false,"lastModified":"2024-11-23T10:09:43.000Z","likes":18,"trendingScore":1,"private":false,"sha":"a2d98c4103c6dbd05f4c6f02c33376a9942ccb20","description":"More info can be found here: https://oscar-project.github.io/documentation/versions/mOSCAR/\nPaper link: https://arxiv.org/abs/2406.08707\nNew features:\n\nAdditional filtering steps were applied to remove toxic content (more details in the next version of the paper, coming soon).\nSpanish split is now complete.\nFace detection in images to blur them once downloaded (coordinates are reported on images of size 256 respecting aspect ratio).\nAdditional language identification of the documents to… See the full description on the dataset page: https://huggingface.co/datasets/oscar-corpus/mOSCAR.","downloads":756,"tags":["license:cc-by-4.0","size_categories:100M<n<1B","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2406.08707","region:us"],"createdAt":"2024-06-05T17:13:12.000Z","key":""},{"_id":"6660d3ce346ffbe5a37a61f0","id":"fol-traces/fol-traces","author":"fol-traces","disabled":false,"gated":false,"lastModified":"2026-06-12T20:32:36.000Z","likes":3,"trendingScore":1,"private":false,"sha":"e0ae2fbfa163a29f8545db75b748555c2bc276e8","description":"\n\t\n\t\t\n\t\n\t\n\t\tcitation\n\t\n\n  @misc{lee2025foltraces,\n    title={FOL-Traces: Verified First-Order Logic Reasoning Traces at Scale},\n    author={Lee, Isabelle and Liaw, Sarah and Yogatama, Dani},\n    year={2025},\n    eprint={2505.14932},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI},\n    url={https://arxiv.org/abs/2505.14932}\n  }\n\n","downloads":995,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:1B<n<10B","arxiv:2505.14932","region:us","logic","first-order-logic","fol"],"createdAt":"2024-06-05T21:08:30.000Z","key":""},{"_id":"66616c59189e16cefc5ba031","id":"AgentGym/AgentTraj-L","author":"AgentGym","disabled":false,"gated":false,"lastModified":"2025-09-02T13:11:20.000Z","likes":25,"trendingScore":1,"private":false,"sha":"cb518f5359f0b635a6b8c0c5cbf0469939a927c7","downloads":285,"tags":["size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","region:us"],"createdAt":"2024-06-06T07:59:21.000Z","key":""},{"_id":"6661b30f814eb473e01b2243","id":"xiang709/VRSBench","author":"xiang709","disabled":false,"gated":false,"lastModified":"2026-06-11T16:54:14.000Z","likes":17,"trendingScore":1,"private":false,"sha":"6cee2968fd752a6d51c6cb2d18dded2bc0baa218","description":"\n\t\n\t\t\n\t\n\t\n\t\tVRSBench\n\t\n\n\n    \n\n\nVRSBench is a Versatile Vision-Language Benchmark for Remote Sensing Image Understanding. It consists of 29,614 remote sensing images with detailed captions, 52,472 object refers, and 3123,221 visual question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. \n\n\t\n\t\t\n\t\n\t\n\t\tUsing datasets\n\t\n\nfrom datasets import load_dataset\nfw = load_dataset(\"xiang709/VRSBench\"… See the full description on the dataset page: https://huggingface.co/datasets/xiang709/VRSBench.","downloads":2666,"tags":["task_categories:visual-question-answering","task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:10K<n<100K","region:us","remote sensing, vision-language models"],"createdAt":"2024-06-06T13:01:03.000Z","key":""},{"_id":"666393b9f7aa5265a21b34bb","id":"xlangai/BRIGHT","author":"xlangai","disabled":false,"gated":false,"lastModified":"2025-03-01T16:51:21.000Z","likes":73,"trendingScore":1,"private":false,"sha":"3066d29c9651a576c8aba4832d249807b181ecae","description":"\n\t\n\t\t\n\t\tBRIGHT benchmark\n\t\n\nBRIGHT is the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. \nThe queries are collected from diverse domains (StackExchange, LeetCode, and math competitions), all sourced from realistic human data.\nExperiments show that existing retrieval models perform poorly on BRIGHT, where the highest score is only 22.1 measured by nDCG@10.\nBRIGHT provides a good testbed for future retrieval research in more realistic and… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/BRIGHT.","downloads":12812,"tags":["task_categories:text-retrieval","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2407.12883","region:us","text-retrieval","code","biology","earth_science","economics","psychology","robotics","math"],"createdAt":"2024-06-07T23:11:53.000Z","key":""},{"_id":"66639a15f4b20b92cac90015","id":"Unispac/shallow-vs-deep-safety-alignment-dataset","author":"Unispac","disabled":false,"gated":"auto","lastModified":"2025-04-23T03:49:36.000Z","likes":1,"trendingScore":1,"private":false,"sha":"861f7d5ab930230b1f3856e63af18b839a38bea3","description":"\n\t\n\t\t\n\t\tLicense Agreement\n\t\n\nThis dataset contains the derivatives of the LLM-Tuning-Safety/HEx-PHI, and therefore the usage of this dataset should follow the license agreement of hexphi.\nBelow is a duplicate of the license's terms and conditions.\nThis Agreement contains the terms and conditions that govern your access and use of the HEx-PHI Dataset (as defined above). You may not use the HEx-PHI Dataset if you do not accept this Agreement. By clicking to accept, accessing the HEx-PHI Dataset… See the full description on the dataset page: https://huggingface.co/datasets/Unispac/shallow-vs-deep-safety-alignment-dataset.","downloads":82,"tags":["license:other","modality:text","region:us"],"createdAt":"2024-06-07T23:39:01.000Z","key":""},{"_id":"666513f121aa69e38699e6d3","id":"UCSC-VLAA/MedTrinity-25M","author":"UCSC-VLAA","disabled":false,"gated":"auto","lastModified":"2024-10-11T00:47:43.000Z","likes":211,"trendingScore":1,"private":false,"sha":"89e5c684794e5c4cc1af9e8f1a7798af7c937dbf","description":"\n\t\n\t\t\n\t\tTutorial of using Medtrinity-25M\n\t\n\nMedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding… See the full description on the dataset page: https://huggingface.co/datasets/UCSC-VLAA/MedTrinity-25M.","downloads":1955,"tags":["task_categories:question-answering","language:en","size_categories:10M<n<100M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2408.02900","region:us","medical"],"createdAt":"2024-06-09T02:31:13.000Z","key":""},{"_id":"6665577ada1e6adefec9f16a","id":"Yueha0/FoodDialogues","author":"Yueha0","disabled":false,"gated":false,"lastModified":"2024-06-18T09:43:18.000Z","likes":5,"trendingScore":1,"private":false,"sha":"ed22fa957e4ff1b0796471340cf3876579232304","description":"\n\t\n\t\t\n\t\tFoodDialogues\n\t\n\nFoodDialogues is built from the Nutrition5k dataset, \nwhich contains ingredient labels and precise nutrition information, making it unique and suitable for various conversational topics. \nSpecifically, we follow the training and testing splits of the original data set and selected an overhead RGB image and a well-angled (angle A or D) video frame for each sample. \nSend the sample's ingredient list and detailed nutritional information to GPT-4 in the form of plain text… See the full description on the dataset page: https://huggingface.co/datasets/Yueha0/FoodDialogues.","downloads":163,"tags":["license:apache-2.0","arxiv:2312.14991","region:us"],"createdAt":"2024-06-09T07:19:22.000Z","key":""},{"_id":"6665c0cf3329cf66adec2f8d","id":"nothingiisreal/Human_Stories","author":"nothingiisreal","disabled":false,"gated":false,"lastModified":"2024-06-09T15:02:12.000Z","likes":9,"trendingScore":1,"private":false,"sha":"076b24067faf32f5b2ac1d2e3333d2b8ea9f6f4c","description":"We took this dataset [https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts] and then downloaded the entire subreddit, made a script to search through it and compiled a new dataset that has human writings in contrast to AI.\nWe used this dataset to train a classifier which has 95% accuracy you can find here [https://huggingface.co/nothingiisreal/open-gpt-3.5-detector]\nHowever, the main goal of this is instead to remove the watermarks imposed by OpenAI and other AI companies including… See the full description on the dataset page: https://huggingface.co/datasets/nothingiisreal/Human_Stories.","downloads":65,"tags":["license:other","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-09T14:48:47.000Z","key":""},{"_id":"6665de52b51b5864406f7cf5","id":"gretelai/synthetic_pii_finance_multilingual","author":"gretelai","disabled":false,"gated":false,"lastModified":"2024-06-11T03:00:20.000Z","likes":77,"trendingScore":1,"private":false,"sha":"7b844d16738527a04264f50214cb426a4cea0897","description":"\n  \n  Image generated by DALL-E. See prompt for more details\n\n\n\n\t\n\t\t\n\t\t💼 📊 Synthetic Financial Domain Documents with PII Labels\n\t\n\ngretelai/synthetic_pii_finance_multilingual is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using Gretel Navigator and released under Apache 2.0.\nThis dataset is designed to assist with the following use cases:\n\n🏷️ Training NER (Named Entity Recognition) models to detect and label PII in… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual.","downloads":940,"tags":["task_categories:text-classification","task_categories:fill-mask","task_categories:token-classification","language:en","language:fr","language:de","language:nl","language:es","language:it","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2306.05685","region:us","synthetic","PII","finance","full-documents"],"createdAt":"2024-06-09T16:54:42.000Z","key":""},{"_id":"666866f02ae95a428c6da904","id":"nkp37/OpenVid-1M","author":"nkp37","disabled":false,"gated":false,"lastModified":"2026-03-31T03:50:30.000Z","likes":272,"trendingScore":1,"private":false,"sha":"d8a63bd22989c80b5734ec2bb989f4e1b61a5807","description":"\n  \n\n\n\n\t\n\t\t\n\t\tSummary\n\t\n\nThis is the dataset proposed in our paper [ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation.\nOpenVid-1M is a high-quality text-to-video dataset designed for research institutions to enhance video quality, featuring high aesthetics, clarity, and resolution. It can be used for direct training or as a quality tuning complement to other video datasets.\nAll videos in the OpenVid-1M dataset have resolutions of at least 512×512.… See the full description on the dataset page: https://huggingface.co/datasets/nkp37/OpenVid-1M.","downloads":70372,"tags":["task_categories:text-to-video","task_categories:image-to-video","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:csv","modality:tabular","modality:text","modality:video","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2407.02371","region:us","text-to-video","image-to-video","Video Generative Model Training","Text-to-Video Diffusion Model Training","prompts","Video Understanding"],"createdAt":"2024-06-11T15:02:08.000Z","key":""},{"_id":"6668f5db07b1c1b5792f455a","id":"appier-ai-research/StreamBench","author":"appier-ai-research","disabled":false,"gated":false,"lastModified":"2024-08-14T04:57:30.000Z","likes":6,"trendingScore":1,"private":false,"sha":"a387d5d86c4fab1b7cff724a3cdea0299a716b91","description":"\nStreamBench paper link: https://arxiv.org/abs/2406.08747 (The links for the original raw datasets on StreamBench can be found in Appendix F)\nIf you find our work helpful, please cite as:\n\n@article{wu2024streambench,\n  title={StreamBench: Towards Benchmarking Continuous Improvement of Language Agents},\n  author={Wu, Cheng-Kuang and Tam, Zhi Rui and Lin, Chieh-Yen and Chen, Yun-Nung and Lee, Hung-yi},\n  journal={arXiv preprint arXiv:2406.08747},\n  year={2024}\n}\n\n","downloads":348,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2406.08747","region:us"],"createdAt":"2024-06-12T01:11:55.000Z","key":""},{"_id":"66699b339a7d2f421793e198","id":"JailbreakBench/JBB-Behaviors","author":"JailbreakBench","disabled":false,"gated":false,"lastModified":"2024-09-26T11:05:44.000Z","likes":108,"trendingScore":1,"private":false,"sha":"886acc352a31533ffbcf4ef22c744658688086fc","description":"\n  \n\n\n\n    An Open Robustness Benchmark for Jailbreaking Language Models\n    \n\n\n\n    NeurIPS 2024 Datasets and Benchmarks Track\n    \n\n\n\n    Paper |\n    Leaderboard |\n    Benchmark code\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tWhat is JailbreakBench?\n\t\n\nJailbreakbench is an open-source robustness benchmark for jailbreaking large language models (LLMs). The goal of this benchmark is to comprehensively track progress toward (1) generating successful jailbreaks and (2) defending against these jailbreaks. To this end, we… See the full description on the dataset page: https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors.","downloads":19087,"tags":["language:en","license:mit","size_categories:n<1K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2404.01318","arxiv:2311.03348","arxiv:2307.15043","arxiv:2402.04249","doi:10.57967/hf/2540","region:us","jailbreaks","large language models","harmful behaviors","ml safety"],"createdAt":"2024-06-12T12:57:23.000Z","key":""},{"_id":"6669b73c38fa6af7a7e4d1e8","id":"Laz4rz/wikipedia_science_chunked_small_rag_512","author":"Laz4rz","disabled":false,"gated":false,"lastModified":"2024-06-12T15:57:20.000Z","likes":4,"trendingScore":1,"private":false,"sha":"d705d766d7fefe604c119388364c338b58242025","description":"\n\t\n\t\t\n\t\tScienceWikiSmallChunk\n\t\n\nProcessed version of millawell/wikipedia_field_of_science, prepared to be used in small context length RAG systems. Chunk length is tokenizer dependent, but each chunk should be around 512 tokens. Longer wikipedia pages have been split into smaller entries, with title added as a prefix.\nThere is also 256 tokens dataset available: Laz4rz/wikipedia_science_chunked_small_rag_256\nIf you wish to prepare some other chunk length:\n\nuse… See the full description on the dataset page: https://huggingface.co/datasets/Laz4rz/wikipedia_science_chunked_small_rag_512.","downloads":59,"tags":["task_categories:text-generation","task_categories:text-classification","task_categories:question-answering","language:en","license:cc-by-sa-3.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","RAG","Retrieval Augmented Generation","Small Chunks","Wikipedia","Science","Scientific","Scientific Wikipedia","Science Wikipedia","512 tokens"],"createdAt":"2024-06-12T14:57:00.000Z","key":""},{"_id":"6669c3b09cba11bd96dfd9a5","id":"Laz4rz/wikipedia_science_chunked_small_rag_256","author":"Laz4rz","disabled":false,"gated":false,"lastModified":"2024-06-12T15:57:16.000Z","likes":3,"trendingScore":1,"private":false,"sha":"4790153b18e174267c8fa472bb30dee5d47284ae","description":"\n\t\n\t\t\n\t\tScienceWikiSmallChunk\n\t\n\nProcessed version of millawell/wikipedia_field_of_science, prepared to be used in small context length RAG systems. Chunk length is tokenizer dependent, but each chunk should be around 256 tokens. Longer wikipedia pages have been split into smaller entries, with title added as a prefix.\nThere is also 512 tokens dataset available: Laz4rz/wikipedia_science_chunked_small_rag_512\nIf you wish to prepare some other chunk length:\n\nuse… See the full description on the dataset page: https://huggingface.co/datasets/Laz4rz/wikipedia_science_chunked_small_rag_256.","downloads":38,"tags":["task_categories:text-generation","task_categories:text-classification","task_categories:question-answering","language:en","license:cc-by-sa-3.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","RAG","Retrieval Augmented Generation","Small Chunks","Wikipedia","Science","Scientific","Scientific Wikipedia","Science Wikipedia","256 tokens"],"createdAt":"2024-06-12T15:50:08.000Z","key":""},{"_id":"6669dce1a5e016305f936c8e","id":"sorry-bench/sorry-bench-202406","author":"sorry-bench","disabled":false,"gated":"auto","lastModified":"2024-07-02T19:55:07.000Z","likes":22,"trendingScore":1,"private":false,"sha":"b34822276edde97592eda99c0b56d306f8830469","description":"\n\t\n\t\t\n\t\tDataset Card for SORRY-Bench Dataset (2024/06)\n\t\n\n\n\n\n  🏠Website \n\n\n  📑Paper \n\n\n  📚Dataset \n\n\n  💻Github \n\n\n  🧑‍⚖️Human Judgment Dataset \n\n\n  🤖Judge LLM \n\n\n\n\nThis dataset contains 9.5K potentially unsafe instructions, intended to be used for LLM safety refusal evaluation.\nParticularly, our base dataset consists of 450 unsafe instructions in total, spanning across 45 finegrained safety categories (10 data points per category).\nThe dataset we present here equally captures risks from… See the full description on the dataset page: https://huggingface.co/datasets/sorry-bench/sorry-bench-202406.","downloads":1446,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","language:zh","language:fr","language:ml","language:mr","language:ta","license:other","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2406.14598","region:us","croissant","safety"],"createdAt":"2024-06-12T17:37:37.000Z","key":""},{"_id":"6669ed3bcf27eed0f2ecc297","id":"namkoong-lab/PersonalLLM","author":"namkoong-lab","disabled":false,"gated":false,"lastModified":"2025-02-25T15:14:25.000Z","likes":18,"trendingScore":1,"private":false,"sha":"643192faf99effc9403ce80dc5a38a016b6f7247","description":"\n\t\n\t\t\n\t\tDataset Card for PersonalLLM\n\t\n\n\n\nThe PersonalLLM dataset is a collection of prompts, responses, and rewards designed for personalized language model methodology development and evaluation.  This dataset is presented in the paper PersonalLLM: Tailoring LLMs to Individual Preferences.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\n\n\nCurated by: Andrew Siah*, Tom Zollo*, Naimeng Ye, Ang Li, Namkoong Hongseok\nFunded by: Digital Future Initiative at Columbia Business School… See the full description on the dataset page: https://huggingface.co/datasets/namkoong-lab/PersonalLLM.","downloads":395,"tags":["task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2409.20296","doi:10.57967/hf/2509","region:us"],"createdAt":"2024-06-12T18:47:23.000Z","key":""},{"_id":"666a096ee0ecb2edac2cdc1f","id":"roman-bushuiev/MassSpecGym","author":"roman-bushuiev","disabled":false,"gated":false,"lastModified":"2026-06-19T15:39:57.000Z","likes":22,"trendingScore":1,"private":false,"sha":"c9aa3feb5f6ec0adee56cc78d2dce24826356156","description":"\n  \n\n\nMassSpecGym provides a dataset and benchmark for the discovery and identification of new molecules from tandem mass spectrometry (MS/MS) spectra. The provided challenges abstract the process of scientific discovery of new molecules from biological and environmental samples into well-defined machine learning problems.\n\n\t\n\t\t\n\t\n\t\n\t\tPapers\n\t\n\n\nMassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery (2025): Paper Link\nMassSpecGym: A benchmark for… See the full description on the dataset page: https://huggingface.co/datasets/roman-bushuiev/MassSpecGym.","downloads":1317,"tags":["task_categories:other","license:mit","size_categories:100K<n<1M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2606.19624","arxiv:2410.23326","region:us","chemistry","mass-spectrometry","molecule-discovery"],"createdAt":"2024-06-12T20:47:42.000Z","key":""},{"_id":"666a0a380abf7b9054b437ae","id":"allenai/discoverybench","author":"allenai","disabled":false,"gated":false,"lastModified":"2025-05-10T09:18:10.000Z","likes":18,"trendingScore":1,"private":false,"sha":"e54ec033049d3a0fd95d3c746919cc8c01c25781","description":"Data-driven Discovery Benchmark from the paper:\n\"DiscoveryBench: Towards Data-Driven Discovery with Large Language Models\"\n\n  \n    \n  \n\n\n\n\t\n\t\t\n\t\t🔭 Overview\n\t\n\nDiscoveryBench is designed to systematically assess current model capabilities in data-driven discovery tasks and provide a useful resource for improving them. Each DiscoveryBench task consists of a goal and dataset(s). Solving the task requires both statistical analysis and semantic reasoning. A faceted evaluation allows open-ended… See the full description on the dataset page: https://huggingface.co/datasets/allenai/discoverybench.","downloads":1076,"tags":["task_categories:text-generation","license:odc-by","size_categories:n<1K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-12T20:51:04.000Z","key":""},{"_id":"666ab4d74a6c703a09f0c091","id":"MultimodalUniverse/jwst","author":"MultimodalUniverse","disabled":false,"gated":false,"lastModified":"2024-12-06T01:34:21.000Z","likes":4,"trendingScore":1,"private":false,"sha":"4d3993086970725dba6248eba3d8a04e95efe3a3","description":"---\ndescription: 'Image dataset based on a combination of JWST deep fields from DJA: CEERS,\n\n  NGDEEP, JADES, PRIMER\n  '\nhomepage: https://dawn-cph.github.io/dja/index.html\nversion: 1.1.0\ncitation: \"% % ACKNOWLEDGEMENTS\\n% % From: https://dawn-cph.github.io/dja/index.html\\n\\\n  % We kindly request all scientific papers based on data or products downloaded from  \\ the Dawn JWST Archive (DJA) to include the following acknowledgement:\\n% \\n% (Some  \\ of) The data products presented herein were… See the full description on the dataset page: https://huggingface.co/datasets/MultimodalUniverse/jwst.","downloads":143,"tags":["size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2302.10936","arxiv:2302.05466","arxiv:2211.02495","arxiv:2306.02465","region:us"],"createdAt":"2024-06-13T08:59:03.000Z","key":""},{"_id":"666e1d22a119281ee088a3da","id":"c3rl/IIIT-INDIC-HW-WORDS-Hindi","author":"c3rl","disabled":false,"gated":false,"lastModified":"2024-06-21T18:58:45.000Z","likes":4,"trendingScore":1,"private":false,"sha":"2a27244ff5f5f5eaaf86aa4b9411beb356921f51","description":"\n\t\n\t\t\n\t\tIIIT-INDIC-HW-WORDS-Hindi\n\t\n\nDataset containing images of hand written words in Devanagari by various humans and the corresponding text of those images.\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe dataset, originally developed by the Centre for Visual Information Technology (CVIT) at IIIT Hyderabad, has been transformed into Parquet format to facilitate its use in modern machine learning workflows. This dataset primarily targets recognition of handwritten Hindi words and aims to advance research and… See the full description on the dataset page: https://huggingface.co/datasets/c3rl/IIIT-INDIC-HW-WORDS-Hindi.","downloads":190,"tags":["task_categories:image-to-text","task_categories:image-classification","task_categories:image-to-image","language:hi","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-15T23:00:50.000Z","key":""},{"_id":"666eaad1cc1f54dfa9b7f1ea","id":"seyam2023/bangladesh_law","author":"seyam2023","disabled":false,"gated":false,"lastModified":"2024-06-16T09:33:28.000Z","likes":2,"trendingScore":1,"private":false,"sha":"2e68fcfe78744059085042c746abe71933698d68","downloads":26,"tags":["license:apache-2.0","region:us"],"createdAt":"2024-06-16T09:05:21.000Z","key":""},{"_id":"667001ddb6783a0f36cb7879","id":"axiong/imagenet-r","author":"axiong","disabled":false,"gated":false,"lastModified":"2024-06-19T06:50:57.000Z","likes":2,"trendingScore":1,"private":false,"sha":"762a23f169e080bef0be7c9d96bf73a126cbb660","description":"\n\t\n\t\t\n\t\tImageNet-R\n\t\n\nThis repo is made to facilitate the evaluation of various pretraining models. It's constructed from the source file provided by official implementation.\n\n\t\n\t\t\n\t\tUsage\n\t\n\nfrom datasets import load_dataset\n\ndataset = load_dataset('axiong/imagenet-r')\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and video… See the full description on the dataset page: https://huggingface.co/datasets/axiong/imagenet-r.","downloads":2195,"tags":["license:mit","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-17T09:29:01.000Z","key":""},{"_id":"667006e1f55666f96737c46c","id":"astergiou/LAVIB","author":"astergiou","disabled":false,"gated":false,"lastModified":"2024-11-01T15:44:05.000Z","likes":4,"trendingScore":1,"private":false,"sha":"10b1be6f40d0b00f3817ee965d23950a206d3e42","description":"Data for LAVIB: A Large-scale Video Interpolation Benchmark (arxiv link: arxiv.org/abs/2406.09754)\n","downloads":881,"tags":["license:cc-by-nc-sa-4.0","size_categories:1M<n<10M","format:csv","modality:tabular","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2406.09754","region:us"],"createdAt":"2024-06-17T09:50:25.000Z","key":""},{"_id":"6670b0abc300b0c8f042a4f5","id":"shreyanmitra/QAWithoutRAGLLMPrompts","author":"shreyanmitra","disabled":false,"gated":false,"lastModified":"2024-07-02T22:20:58.000Z","likes":1,"trendingScore":1,"private":false,"sha":"b84f0dda3014f3397aa3fd9b3b076f3442741440","description":"\n\t\n\t\t\n\t\tDataset Card for QAWithoutRAGLLMPrompts\n\t\n\n\n\nA cleaned and consolidated set of questions (without context) and answers for LLM hallucination detection. Each question-answer pair is not the work of the author, but was selected from one of the following datasets:\n\nTruthful QA (Multiple Choice)\nTruthful QA (Text Generation)\nTrivia QA\nARC\n\nIf you use any of the data provided, please cite the sources above in addition to the following paper:\n Shreyan Mitra and Leilani Gilpin. Detecting LLM… See the full description on the dataset page: https://huggingface.co/datasets/shreyanmitra/QAWithoutRAGLLMPrompts.","downloads":16,"tags":["task_categories:text-classification","task_categories:text-generation","task_categories:question-answering","language:en","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-17T21:54:51.000Z","key":""},{"_id":"667551a1b3882fd587c7eb25","id":"FarmerlineML/somali_cleaned_dataset","author":"FarmerlineML","disabled":false,"gated":false,"lastModified":"2024-06-21T10:31:39.000Z","likes":2,"trendingScore":1,"private":false,"sha":"9fa74fb4e87a886d3a86799f172a22ec528abaeb","downloads":52,"tags":["size_categories:1K<n<10K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-21T10:10:41.000Z","key":""},{"_id":"66755d9d9f2810b0096ac389","id":"hf-audio/open-asr-leaderboard","author":"hf-audio","disabled":false,"gated":false,"lastModified":"2026-06-26T13:12:28.000Z","likes":42,"trendingScore":1,"private":false,"sha":"b6bdcd0beb34f8975dc659796176d88f43aff502","description":"\n\t\n\t\t\n\t\n\t\n\t\tESB Test Sets: Parquet & Sorted\n\t\n\nThis dataset takes the open-asr-leaderboard/datasets-test-only data and sorts each split by audio length. \nThe format is also changed, from custom loading script (un-safe remote code) to parquet (safe).\nBroadly speaking, this dataset was generated with the following code-snippet:\nfrom datasets import load_dataset, get_dataset_config_names\n\nDATASET = \"open-asr-leaderboard/datasets-test-only\"  # dataset to load from\nHUB_DATASET_ID =… See the full description on the dataset page: https://huggingface.co/datasets/hf-audio/open-asr-leaderboard.","downloads":20434,"tags":["benchmark:official","benchmark:eval-yaml","size_categories:100K<n<1M","modality:audio","modality:text","arxiv:2510.06961","region:us"],"createdAt":"2024-06-21T11:01:49.000Z","key":""},{"_id":"6677ce2a20ff491d3822fd59","id":"rexoscare/autocomplete-search-dataset","author":"rexoscare","disabled":false,"gated":false,"lastModified":"2024-06-23T07:27:03.000Z","likes":1,"trendingScore":1,"private":false,"sha":"16fb03565094160eb059ef55f60fa820cb6f3eb4","downloads":75,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-06-23T07:26:34.000Z","key":""},{"_id":"667953a82c09cbf312b62a1e","id":"irlspbru/RusLawOD","author":"irlspbru","disabled":false,"gated":false,"lastModified":"2026-04-29T07:30:41.000Z","likes":18,"trendingScore":1,"private":false,"sha":"f850b966648499d7ff4f4bc3ef2cddb68f4ec3c0","description":"\n\t\n\t\t\n\t\tThe Russian Legislative Corpus, 1991–2025\n\t\n\nRussian primary and secondary legislation corpus covering laws of Russian Federation, decrees by the President of RF, regulations by the government published as of December, 31, 2025. The corpus collects all 304,382 texts (194,425,905 tokens) of non-secret federal regulations and acts, along with their metadata. The corpus has two versions: the original text with minimal preprocessing and a version prepared for linguistic analysis with… See the full description on the dataset page: https://huggingface.co/datasets/irlspbru/RusLawOD.","downloads":881,"tags":["language:ru","license:cc-by-nc-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2406.04855","region:us","legal","corpus"],"createdAt":"2024-06-24T11:08:24.000Z","key":""},{"_id":"667ab99edb56acf219d8d646","id":"FreedomIntelligence/PubMedVision","author":"FreedomIntelligence","disabled":false,"gated":false,"lastModified":"2025-02-18T07:44:10.000Z","likes":106,"trendingScore":1,"private":false,"sha":"3c84e04b38bceb5341419b9a4f8ca37ba790cb84","description":"\n\t\n\t\t\n\t\tNews\n\t\n\n\n[2025/02/18]: We add the original captions of PubMedVision in PubMedVision_Original_Caption.json, as well as the Chinese version of PubMedVision in PubMedVision_Chinese.json.\n[2024/07/01]: We add annotations for 'body_part' and 'modality' of images, utilizing the HuatuoGPT-Vision-7B model.\n\n\n\t\n\t\t\n\t\tPubMedVision\n\t\n\nPubMedVision is a large-scale medical VQA dataset. We extracted high-quality image-text pairs from PubMed and used GPT-4V to reformat them to enhance their quality.… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/PubMedVision.","downloads":552,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:apache-2.0","size_categories:1M<n<10M","format:json","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2406.19280","region:us","GPT-4V","Vision","medical","biology"],"createdAt":"2024-06-25T12:35:42.000Z","key":""},{"_id":"667be8b5a0e1bc14069c4974","id":"TucanoBR/GigaVerbo","author":"TucanoBR","disabled":false,"gated":false,"lastModified":"2025-07-24T08:07:05.000Z","likes":35,"trendingScore":1,"private":false,"sha":"f927ebe8aa40bbe027836907d9464c07190035f7","description":"\n\t\n\t\t\n\t\tGigaVerbo: a 780 GB Dataset of Portuguese Text\n\t\n\n\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nGigaVerbo is an extensive dataset comprising 780 GB of Portuguese text, being a concatenated version of several datasets available in Hugging Face, containing over 200 billion tokens. It encompasses various sources, including crawled websites, articles, translated conversations, and legal documents. This dataset offers a comprehensive and rich resource for various natural language processing tasks, providing… See the full description on the dataset page: https://huggingface.co/datasets/TucanoBR/GigaVerbo.","downloads":2109,"tags":["task_categories:text-generation","language:pt","license:other","size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2411.07854","doi:10.57967/hf/5835","region:us","portuguese","language-modeling"],"createdAt":"2024-06-26T10:08:53.000Z","key":""},{"_id":"667c9f5b492cd3560db9706f","id":"multimodalart/1920-raider-waite-tarot-public-domain","author":"multimodalart","disabled":false,"gated":false,"lastModified":"2024-08-14T13:35:16.000Z","likes":56,"trendingScore":1,"private":false,"sha":"8b5e156e58fdce2e22325b14976fc85c37676b01","downloads":138,"tags":["size_categories:n<1K","format:imagefolder","modality:image","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2024-06-26T23:08:11.000Z","key":""},{"_id":"667d51d7db586ee20f15106b","id":"nothingiisreal/entirety_of_reddit","author":"nothingiisreal","disabled":false,"gated":false,"lastModified":"2024-06-27T11:53:49.000Z","likes":4,"trendingScore":1,"private":false,"sha":"82c5ea1bc738e86f8008298a222aee1a51ddda89","description":"I stumbled upon this 2.7TB large torrent that is basically a dump of entirety of reddit between 2005-2023.\nIt's an absolute treasure trove of data especially for LLM research but probably even for LDM research.\nAcademic Torrents Link\n","downloads":11,"tags":["task_categories:text-generation","region:us"],"createdAt":"2024-06-27T11:49:43.000Z","key":""},{"_id":"667f0a2e20ee9ac4177704ff","id":"robrenaud/multilingual_tinystories","author":"robrenaud","disabled":false,"gated":false,"lastModified":"2024-07-22T19:35:06.000Z","likes":2,"trendingScore":1,"private":false,"sha":"785db5c10f23ba13893cbe692943d656170dadb8","description":"An TinyStories dataset for Spanish.  The code to generate this is here.  https://github.com/rrenaud/multilingual_tinystories\n","downloads":62,"tags":["language:es","license:apache-2.0","region:us"],"createdAt":"2024-06-28T19:08:30.000Z","key":""},{"_id":"6681c0923f5ec7c82a63ee0b","id":"mlfoundations/dclm-baseline-1.0-parquet","author":"mlfoundations","disabled":false,"gated":false,"lastModified":"2024-07-19T17:35:58.000Z","likes":53,"trendingScore":1,"private":false,"sha":"817d6752765f6a41261085171dd546b104f60626","description":"\n\t\n\t\t\n\t\tDCLM-baseline\n\t\n\nNote: this is an identical copy of https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0, where all the files have been mapped to a parquet format.\nDCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks.\nBelow are comparisions of model trained on DCLM-baseline with other models in the 7B regime.\n\n\t\n\t\t\nModel\nParams\nTokens\nOpen dataset?\nCORE\nMMLU\nEXTENDED\n\n\n\t\t\nOpen weights, closed datasets… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet.","downloads":26900,"tags":["language:en","license:cc-by-4.0","size_categories:1B<n<10B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2406.11794","region:us"],"createdAt":"2024-06-30T20:31:14.000Z","key":""},{"_id":"6681e25d2b6af3f60ac68675","id":"LouisChen15/ConstructionSite","author":"LouisChen15","disabled":false,"gated":"auto","lastModified":"2026-05-11T00:54:44.000Z","likes":53,"trendingScore":1,"private":false,"sha":"ca3d9b885b45cbec956817edc42253664c7faf3f","description":"\n\t\n\t\t\n\t\tDataset Card for ConstructionSite 10k\n\t\n\n\n\t\n\t\t\n\t\tDataset summary\n\t\n\nThe dataset consists of a total of 10,013 construction site images and their annotations. Among them, 7,009 images are assigned to the training split while 3,004 images are assigned to the test split.\nIf you use this dataset, we would appreciate you citing our work. See Citation information. We developed the dataset to test how well vision language models (VLMs) can understand construction site images and use their… See the full description on the dataset page: https://huggingface.co/datasets/LouisChen15/ConstructionSite.","downloads":483,"tags":["task_categories:image-to-text","task_categories:image-feature-extraction","language:en","license:cc-by-nc-4.0","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","civil engineering","construction safety","computer vision","natural language processing","image captioning","visual question answering (VQA)","visual grounding"],"createdAt":"2024-06-30T22:55:25.000Z","key":""},{"_id":"66837949d8841593845cc241","id":"zxbsmk/NSFW-T2I","author":"zxbsmk","disabled":false,"gated":false,"lastModified":"2024-07-02T06:45:38.000Z","likes":19,"trendingScore":1,"private":false,"sha":"284d9ab74e67dbe4f7ec16ce8a7b152982a1cf81","description":"\n\t\n\t\t\n\t\tIntroduction (Version 1)\n\t\n\nAbout 38k image-text pairs(10k from LAION and 28k from nsfw_detect), and captions are generated by LLaVA-NeXT with prompt \"Describe the photo in detail (attributes of person)\".\nThe \"txt\" column shown in the dataset viewer is originated from LAION, not the captions yielded by LLaVA-NeXT.\n\n\t\n\t\t\n\t\n\t\n\t\tCaption Codes\n\t\n\npretrained = \"lmms-lab/llama3-llava-next-8b\"\nmodel_name = \"llava_llama3\"\ndevice = \"cuda:2\"\ndevice_map = \"auto\"\ntokenizer, model, image_processor… See the full description on the dataset page: https://huggingface.co/datasets/zxbsmk/NSFW-T2I.","downloads":193,"tags":["task_categories:image-classification","task_categories:image-to-text","task_categories:text-to-image","language:en","license:apache-2.0","size_categories:10K<n<100K","region:us"],"createdAt":"2024-07-02T03:51:37.000Z","key":""},{"_id":"6684b250986286e214df52b9","id":"walledai/HarmBench","author":"walledai","disabled":false,"gated":"auto","lastModified":"2024-07-31T21:46:08.000Z","likes":49,"trendingScore":1,"private":false,"sha":"fb6c2afd5a2a943d701d6db3efab87d077e81be5","description":"\n\t\n\t\t\n\t\tHarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal\n\t\n\nPaper: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal\nData: Dataset\n\n\t\n\t\t\n\t\tAbout\n\t\n\nIn this dataset card, we only use the behavior prompts proposed in HarmBench.\n\n\t\n\t\t\n\t\tLicense\n\t\n\nMIT\n\n\t\n\t\t\n\t\tCitation\n\t\n\nIf you find HarmBench useful in your research, please consider citing the paper:\n@article{mazeika2024harmbench,\n  title={HarmBench: A… See the full description on the dataset page: https://huggingface.co/datasets/walledai/HarmBench.","downloads":7101,"tags":["language:en","license:mit","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2402.04249","region:us"],"createdAt":"2024-07-03T02:07:12.000Z","key":""},{"_id":"6684d78500cfbd573beee44d","id":"walledai/XSTest","author":"walledai","disabled":false,"gated":"auto","lastModified":"2024-07-04T23:16:07.000Z","likes":25,"trendingScore":1,"private":false,"sha":"f1d713187c61b6ae64e602d74f0b3d812cc2e8e8","description":"\n\t\n\t\t\n\t\tXSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models\n\t\n\nPaper: XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models\nData: xstest_prompts_v2\n\n\t\n\t\t\n\t\tAbout\n\t\n\nWithout proper safeguards, large language models will follow malicious instructions and generate toxic content. This motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless.… See the full description on the dataset page: https://huggingface.co/datasets/walledai/XSTest.","downloads":5996,"tags":["language:en","license:cc-by-4.0","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2308.01263","region:us"],"createdAt":"2024-07-03T04:45:57.000Z","key":""},{"_id":"668500e15ce2a730d284cb96","id":"Exploration-Lab/iSign","author":"Exploration-Lab","disabled":false,"gated":"auto","lastModified":"2024-08-31T12:28:52.000Z","likes":19,"trendingScore":1,"private":false,"sha":"e4ee6c5f0d9dfcbc74205e3f1388ce94da26c298","description":"\n\t\n\t\t\n\t\tiSign: A Benchmark for Indian Sign Language Processing\n\t\n\nThe iSign dataset serves as a benchmark for Indian Sign Language Processing. The dataset comprises of NLP-specific tasks (including SignVideo2Text, SignPose2Text, Text2Pose, Word Prediction, and Sign Semantics). The dataset is free for research use but not for commercial purposes.\n\n\t\n\t\t\n\t\tQuick Links\n\t\n\n\nWebsite: The landing page for iSign\narXiv Paper: Detailed information about the iSign Benchmark.\nDataset on Hugging Face:… See the full description on the dataset page: https://huggingface.co/datasets/Exploration-Lab/iSign.","downloads":219,"tags":["task_categories:translation","license:cc-by-nc-sa-4.0","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2407.05404","region:us","indian sign language","machine translation","sign language translation"],"createdAt":"2024-07-03T07:42:25.000Z","key":""},{"_id":"668602d5bf8002d8e3f4aa91","id":"wenet-e2e/wenetspeech","author":"wenet-e2e","disabled":false,"gated":"auto","lastModified":"2024-07-04T10:20:36.000Z","likes":7,"trendingScore":1,"private":false,"sha":"5fe6045424583e9480cc6ca2a09df6932e3d51fd","description":"WenetSpeech is a 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition. Please visit the official website, read the license, and follow the instruction to apply the data.","downloads":4518,"tags":["arxiv:2110.03370","region:us"],"createdAt":"2024-07-04T02:03:01.000Z","key":""},{"_id":"668688d71119f961e558f167","id":"wangyi111/Copernicus-Pretrain","author":"wangyi111","disabled":false,"gated":false,"lastModified":"2025-04-25T20:01:50.000Z","likes":7,"trendingScore":1,"private":false,"sha":"477d5dba02388539d491bc8d88175c51f0834534","description":"\n\t\n\t\t\n\t\tDataset Card for Copernicus-Pretrain\n\t\n\n\n\nCopernicus-Pretrain is a large-scale EO pretraining dataset with 18.7M aligned images covering all major Sentinel missions (S1,2,3,5P).\nOfficially named Copernicus-Pretrain, also referred to as SSL4EO-S (\"S\" means Sentinel), as an extension of SSL4EO-S12 to the whole Sentinel series.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\n\nCopernicus-Pretrain contains 18.7M aligned imagery from all major Sentinel missions in operation (Sentinel-1 SAR, Sentinel-2… See the full description on the dataset page: https://huggingface.co/datasets/wangyi111/Copernicus-Pretrain.","downloads":4538,"tags":["task_categories:image-classification","task_categories:image-feature-extraction","license:cc-by-4.0","size_categories:10M<n<100M","modality:geospatial","arxiv:2503.11849","region:us","earth-observation","remote-sensing","foundation-model","pretrain","self-supervised-learning","sentinel"],"createdAt":"2024-07-04T11:34:47.000Z","key":""},{"_id":"668811bdf49b0fea9a533690","id":"apoidea/pubtabnet-html","author":"apoidea","disabled":false,"gated":false,"lastModified":"2024-07-05T16:53:42.000Z","likes":23,"trendingScore":1,"private":false,"sha":"0d58324674ff93825c972f7fdfd2887e0ae0a247","downloads":1057,"tags":["task_categories:visual-question-answering","task_categories:image-to-text","task_categories:text-generation","license:cdla-permissive-1.0","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-05T15:31:09.000Z","key":""},{"_id":"668a278fe56e5d94188cacb8","id":"isek-ai/danbooru-wiki-2024","author":"isek-ai","disabled":false,"gated":false,"lastModified":"2024-11-15T04:33:40.000Z","likes":28,"trendingScore":1,"private":false,"sha":"150ab9fe223322ce4e8ca2a1875c796ab7e0b371","description":"\n\t\n\t\t\n\t\tdanbooru-wiki-2024\n\t\n\n\n\t\n\t\t\n\t\tAbout\n\t\n\nWiki pages about the danbooru tags on danbooru.donmai.us. The wiki contains the description of each tag and matching to pixiv tags.\n\n\t\n\t\t\n\t\tUsage\n\t\n\nfrom datasets import load_dataset\n\nds = load_dataset(\n  \"isek-ai/danbooru-wiki-2024\",\n# revision=\"202408-at20240906\", # optional\n  split=\"train\",\n)\n\nThe revision name is as same as isek-ai/danbooru-tags-2024's.\n\n[!WARNING]\nNote:\nThis dataset would be irreguraly updated, if you want to use the same… See the full description on the dataset page: https://huggingface.co/datasets/isek-ai/danbooru-wiki-2024.","downloads":411,"tags":["task_categories:text-generation","task_categories:text-classification","language:en","language:ja","license:cc-by-sa-4.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-07T05:28:47.000Z","key":""},{"_id":"668b6496b424c825f79d9e13","id":"GreenNode/GreenNode-Table-Markdown-Retrieval-VN","author":"GreenNode","disabled":false,"gated":false,"lastModified":"2026-01-08T08:06:53.000Z","likes":5,"trendingScore":1,"private":false,"sha":"699949f7b3897494a3a5bbda993119bf38dd9c9e","description":"\n  GreenNodeTableMarkdownRetrieval\n  An MTEB dataset\n  Massive Text Embedding Benchmark\n\n\nGreenNodeTable documents\n\n\t\n\t\t\n\n\n\n\n\t\t\nTask category\nt2t\n\n\nDomains\nFinancial, Encyclopaedic, Non-fiction\n\n\nReference\nhttps://huggingface.co/GreenNode\n\n\n\t\n\nSource datasets:\n\nGreenNode/GreenNode-Table-Markdown-Retrieval-VN\n\n\n\t\n\t\t\n\t\tHow to evaluate on this task\n\t\n\nYou can evaluate an embedding model on this dataset using the following code:\nimport mteb\n\ntask = mteb.get_task(\"GreenNodeTableMarkdownRetrieval\")… See the full description on the dataset page: https://huggingface.co/datasets/GreenNode/GreenNode-Table-Markdown-Retrieval-VN.","downloads":120,"tags":["task_categories:text-retrieval","task_ids:document-retrieval","annotations_creators:human-annotated","multilinguality:monolingual","source_datasets:GreenNode/GreenNode-Table-Markdown-Retrieval-VN","language:vie","license:mit","size_categories:100K<n<1M","modality:text","arxiv:2502.13595","arxiv:2210.07316","region:us","mteb","text"],"createdAt":"2024-07-08T04:01:26.000Z","key":""},{"_id":"668b9b061a7b7c3e0f5df53d","id":"U4R/DocGenome","author":"U4R","disabled":false,"gated":false,"lastModified":"2024-12-18T01:28:28.000Z","likes":18,"trendingScore":1,"private":false,"sha":"17667838110ac54f322526006b4432c4e52f38ce","description":"\n\t\n\t\t\n\t\tDocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models\n\t\n\npaper link: DocGenome\nWe present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline DocParser. DocGenome features four characteristics:\n\n\nCompleteness: It is the first dataset to structure data from all modalities including 13… See the full description on the dataset page: https://huggingface.co/datasets/U4R/DocGenome.","downloads":785,"tags":["task_categories:question-answering","task_categories:image-to-text","language:en","license:cc-by-4.0","size_categories:1K<n<10K","arxiv:2406.11633","region:us","chemistry","biology","finance","legal","medical","climate"],"createdAt":"2024-07-08T07:53:42.000Z","key":""},{"_id":"668bc695326f9104f903effc","id":"jamessyx/PathMMU","author":"jamessyx","disabled":false,"gated":"manual","lastModified":"2025-01-13T15:27:50.000Z","likes":22,"trendingScore":1,"private":false,"sha":"054e64e56e599e9636024f1471d49ecae4a2784f","description":"This is the official Hugging Face repo for PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology \n🌐 Homepage | 🤗 Dataset | 📖 arXiv | GitHub\n\n\t\n\t\t\n\t\n\t\n\t\t🔔News\n\t\n\n\nImportant Notes!!!!!! \nThe benchmark data and evaluation code have been released (8/7/2024)\n\n\n\t\n\t\t\n\t\n\t\n\t\tAbstract\n\t\n\nThe emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark… See the full description on the dataset page: https://huggingface.co/datasets/jamessyx/PathMMU.","downloads":329,"tags":["license:cc-by-nd-4.0","arxiv:2401.16355","arxiv:2306.11207","region:us"],"createdAt":"2024-07-08T10:59:33.000Z","key":""},{"_id":"668d53dc0a6a758a8db9ce01","id":"adiren7/darija_speech_to_text","author":"adiren7","disabled":false,"gated":false,"lastModified":"2024-08-11T15:34:37.000Z","likes":9,"trendingScore":1,"private":false,"sha":"299da4cf619fe28db90b159b86e9258271971d45","downloads":138,"tags":["task_categories:automatic-speech-recognition","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-09T15:14:36.000Z","key":""},{"_id":"668e601c1173ab43d9d149d4","id":"TencentARC/StoryStream","author":"TencentARC","disabled":false,"gated":false,"lastModified":"2024-07-17T06:12:17.000Z","likes":30,"trendingScore":1,"private":false,"sha":"d48171f45ba5fa24a82663dc4c064594fc9a49bc","description":"\n\t\n\t\t\n\t\tStoryStream Dataset\n\t\n\n\n\n\n\t\n\t\t\n\t\tIntroduction\n\t\n\nThe StoryStream dataset is an innovative resource aimed at advancing multimodal story generation. Originating from popular cartoon series, this dataset includes a comprehensive collection of detailed narratives and high resolution images. It is designed to support the creation of long story sequences. \n\nFigure: Data samples from our StoryStream dataset alongside existing multimodal story generation datasets. Our dataset features visually… See the full description on the dataset page: https://huggingface.co/datasets/TencentARC/StoryStream.","downloads":129,"tags":["language:en","license:apache-2.0","size_categories:100K<n<1M","modality:text","arxiv:2407.08683","region:us"],"createdAt":"2024-07-10T10:19:08.000Z","key":""},{"_id":"668f7fb278abda9a4ed45d53","id":"CanCLID/zoengjyutgaai","author":"CanCLID","disabled":false,"gated":false,"lastModified":"2026-02-05T04:50:46.000Z","likes":29,"trendingScore":1,"private":false,"sha":"829b30725811cabe9b8e2fd0513966470b304f12","description":"\n\t\n\t\t\n\t\t張悦楷講古語音數據集\n\t\n\nEnglish\n呢個係張悦楷講《三國演義》、《水滸傳》、《走進毛澤東的最後歲月》、《鹿鼎記》語音數據集。張悦楷係廣州最出名嘅講古佬 / 粵語説書藝人。佢從上世紀七十年代開始就喺廣東各個收音電台度講古，佢把聲係好多廣州人嘅共同回憶。本數據集收集嘅係佢最知名嘅四部作品。\n數據集用途：\n\nTTS（語音合成）訓練集\nASR（語音識別）訓練集或測試集\n各種語言學、文學研究\n直接聽嚟欣賞藝術！\n\nTTS 效果演示：https://huggingface.co/spaces/laubonghaudoi/zoengjyutgaai_tts\n\n\t\n\t\t\n\t\n\t\n\t\t説明\n\t\n\n\n所有文本都根據 https://jyutping.org/blog/typo/ 同 https://jyutping.org/blog/particles/ 規範用字。\n所有文本都使用全角標點，冇半角標點。\n所有文本都用漢字轉寫，無阿拉伯數字無英文字母\n所有音頻源都存放喺/source，為方便直接用作訓練數據，切分後嘅音頻都放喺 opus/\n所有 opus 音頻皆為 48000… See the full description on the dataset page: https://huggingface.co/datasets/CanCLID/zoengjyutgaai.","downloads":5376,"tags":["task_categories:automatic-speech-recognition","task_categories:text-to-speech","task_categories:text-generation","task_categories:feature-extraction","task_categories:audio-to-audio","task_categories:audio-classification","task_categories:text-to-audio","language:yue","license:cc0-1.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us","cantonese","audio","art"],"createdAt":"2024-07-11T06:46:10.000Z","key":""},{"_id":"66938b821b082600cf1364d9","id":"Gryphe/Sonnet3.5-SlimOrcaDedupCleaned","author":"Gryphe","disabled":false,"gated":false,"lastModified":"2024-10-04T08:41:30.000Z","likes":96,"trendingScore":1,"private":false,"sha":"d1ada626c2c4058dee6e0ca8a440d69f7bc5c843","description":"2024-10-04: I fixed two issues that were affecting newlines (all double newlines were gone!) and cleaned up spaces preceding closing quotes. Many thanks to PocketDoc for bringing this to my attention!\nA Sonnet 3.5 generated version of Caitlyn's wonderfully cleaned SlimOrca Deduped dataset, ready for training using the ShareGPT format. \nAs always, an effort was made to ensure no censoring was applied to the responses. If you find any refusals, let me know!\n","downloads":182,"tags":["license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-14T08:25:38.000Z","key":""},{"_id":"66939a72492733611b48206c","id":"hongxiaoy/OccScanNet","author":"hongxiaoy","disabled":false,"gated":false,"lastModified":"2024-10-12T05:49:08.000Z","likes":7,"trendingScore":1,"private":false,"sha":"5feda6de847a620389c8d635a5b1863aa5f3bcd3","description":"\n\t\n\t\t\n\t\tPreparing ISO\n\t\n\n\n\t\n\t\t\n\t\tDatasets\n\t\n\nWe provide the OccScanNet dataset files here, but you should agree the term of use of ScanNet, CompleteScanNet dataset.\nFor a simplified way to  prepare the dataset, you just download the preprocessed_data to ISO/data/occscannet as gathered_data and download the posed_images to ISO/data/scannet.\nHowever, the complete dataset generating process is provided as followed:\n\n\t\n\t\t\n\t\n\t\n\t\tOccScanNet\n\t\n\n\nClone the official MMDetection3D repository.\n\ngit clone… See the full description on the dataset page: https://huggingface.co/datasets/hongxiaoy/OccScanNet.","downloads":540,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:text","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2024-07-14T09:29:22.000Z","key":""},{"_id":"669500f3a51ccecc904c38f7","id":"FinanceMTEB/DISCFinLLM-Computing","author":"FinanceMTEB","disabled":false,"gated":false,"lastModified":"2024-07-15T10:59:10.000Z","likes":1,"trendingScore":1,"private":false,"sha":"2342751577b08c5ee989174fdac8f08d6d7f3e88","downloads":30,"tags":["size_categories:n<1K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-15T10:58:59.000Z","key":""},{"_id":"6695ed590818007c1c61de00","id":"RUC-NLPIR/FlashRAG_datasets","author":"RUC-NLPIR","disabled":false,"gated":false,"lastModified":"2025-05-06T04:59:56.000Z","likes":94,"trendingScore":1,"private":false,"sha":"bcafb8dd07d453be3cbeeeb3f78be1841bddf92c","description":"\n\t\n\t\t\n\t\t⚡FlashRAG: A Python Toolkit for Efficient RAG Research\n\t\n\nFlashRAG is a Python toolkit for the reproduction and development of Retrieval Augmented Generation (RAG) research. Our toolkit includes 36 pre-processed benchmark RAG datasets and 16 state-of-the-art RAG algorithms. \nWith FlashRAG and provided resources, you can effortlessly reproduce existing SOTA works in the RAG domain or implement your custom RAG processes and components.\nFor more information, please view our GitHub repo… See the full description on the dataset page: https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets.","downloads":13488,"tags":["task_categories:question-answering","task_categories:summarization","language:en","license:cc-by-sa-4.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2405.13576","region:us"],"createdAt":"2024-07-16T03:47:37.000Z","key":""},{"_id":"669720f267c22a79a181d8de","id":"taesiri/PhotoshopRequests","author":"taesiri","disabled":false,"gated":"manual","lastModified":"2024-07-17T22:55:38.000Z","likes":2,"trendingScore":1,"private":false,"sha":"625f8cda14ff20e48366daa0ec692484e2797605","downloads":7,"tags":["task_categories:image-to-text","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","photoshop"],"createdAt":"2024-07-17T01:40:02.000Z","key":""},{"_id":"66993fa7004e418e7113e4b0","id":"BangumiBase/arknightstouinkiro","author":"BangumiBase","disabled":false,"gated":false,"lastModified":"2024-07-18T17:56:27.000Z","likes":1,"trendingScore":1,"private":false,"sha":"c6763f7587b7a3292be0d5169cbd62c436c209e2","description":"\n\t\n\t\t\n\t\tBangumi Image Base of Arknights: Touin Kiro\n\t\n\nThis is the image base of bangumi Arknights: Touin Kiro, we detected 32 characters, 1773 images in total. The full dataset is here.\nPlease note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability).\nHere is… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/arknightstouinkiro.","downloads":1536,"tags":["license:mit","size_categories:1K<n<10K","modality:image","modality:text","region:us","art"],"createdAt":"2024-07-18T16:15:35.000Z","key":""},{"_id":"669a4c1707761aa1751fdc1e","id":"BangumiBase/vinlandsaga","author":"BangumiBase","disabled":false,"gated":false,"lastModified":"2024-07-19T18:31:20.000Z","likes":1,"trendingScore":1,"private":false,"sha":"23ced4bba01a2ae65dc6070274efe20f1f6b3fc4","description":"\n\t\n\t\t\n\t\tBangumi Image Base of Vinland Saga\n\t\n\nThis is the image base of bangumi Vinland Saga, we detected 83 characters, 7812 images in total. The full dataset is here.\nPlease note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend performing necessary preprocessing on the downloaded dataset to eliminate potential noisy samples (approximately 1% probability).\nHere is the characters'… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/vinlandsaga.","downloads":1759,"tags":["license:mit","size_categories:1K<n<10K","modality:image","region:us","art"],"createdAt":"2024-07-19T11:20:55.000Z","key":""},{"_id":"669ca4d6a8b62d05153d363b","id":"QingyuLiu1/ICSD","author":"QingyuLiu1","disabled":false,"gated":"manual","lastModified":"2025-06-11T02:20:05.000Z","likes":22,"trendingScore":1,"private":false,"sha":"ebb20e7d5598681e1feb53e4ec35123174a96bf9","description":"\n\t\n\t\t\n\t\tICSD: An Open-source Dataset for Infant Cry and Snoring Detection\n\t\n\n\n\nThe ICSD dataset is a publicly available resource for the detection of infant cries and snoring sounds. It contains over 4 hours of audio data related to snoring sounds and infant crying sounds, as well as their corresponding annotations.\n\n\t\n\t\t\n\t\tDataset Overview\n\t\n\n\nPlease note that our paper is currently under review. If you're interested in utilizing the dataset, please submit the necessary information on the top… See the full description on the dataset page: https://huggingface.co/datasets/QingyuLiu1/ICSD.","downloads":87,"tags":["license:cc-by-nc-sa-4.0","size_categories:100K<n<1M","modality:audio","region:us","audio","sound event detection"],"createdAt":"2024-07-21T06:04:06.000Z","key":""},{"_id":"669d6a5cc6e9e12d78866b11","id":"SilentAntagonist/vintage-photography-450k-high-quality-captions","author":"SilentAntagonist","disabled":false,"gated":false,"lastModified":"2024-07-21T22:46:01.000Z","likes":35,"trendingScore":1,"private":false,"sha":"7df43c74d1c87d4e351d0ad5f347d6c06486e611","description":"This is a 450k image datastet focused on photography from the 20th century, and their analog aspect. Many of the images are in high resolution. This dataset currently has 20k images captioned with InternVL2 26B, and is a work in progress (I plan to caption the entire dataset and also have short captions for all of the images, compute is an issue for now).\n","downloads":42,"tags":["task_categories:image-classification","task_categories:image-feature-extraction","task_categories:text-to-image","language:en","license:cc-by-sa-4.0","size_categories:100K<n<1M","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-21T20:06:52.000Z","key":""},{"_id":"669ebeda19845ee0549a5977","id":"project-themis/git-commits","author":"project-themis","disabled":false,"gated":false,"lastModified":"2026-05-04T02:32:47.000Z","likes":1,"trendingScore":1,"private":false,"sha":"82aaa3e13b1111be511c88947bfbd768794d4e6e","description":"\n\n\n\t\n\t\t\n\t\tThemis-Git-Commits\n\t\n\n\n\n\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThemis-Git-Commits is a large-scale dataset of single-file code commits mined from permissively licensed GitHub repositories via the BigQuery GitHub public dataset. The SQL query restricts to repositories under permissive open-source licenses only (MIT, Apache-2.0, BSD-2/3-Clause, ISC, CC0-1.0, EPL-1.0, MPL-2.0, Unlicense, AGPL-3.0, LGPL-2.1, Artistic-2.0). The BigQuery snapshot used contains commits up to early 2022 — predating the… See the full description on the dataset page: https://huggingface.co/datasets/project-themis/git-commits.","downloads":2016,"tags":["task_categories:text-generation","language:code","license:apache-2.0","size_categories:10M<n<100M","format:arrow","modality:text","library:datasets","library:mlcroissant","arxiv:2605.00754","arxiv:2308.07124","region:us","code","github","commits","multilingual"],"createdAt":"2024-07-22T20:19:38.000Z","key":""},{"_id":"66a27dd5927b7012d369f343","id":"wufeim/imagenet1k_captions_minigpt4","author":"wufeim","disabled":false,"gated":false,"lastModified":"2024-07-25T16:36:22.000Z","likes":1,"trendingScore":1,"private":false,"sha":"2a3f59c38d6630d0955e1d710d6c20b36696dbfb","description":"\n\t\n\t\t\n\t\tImageNet1k Captions Generated with MiniGPT-4\n\t\n\nMiniGPT-4 captions generated for ImageNet1k images. Can be used for training/finetuning diffusion models for image generation.\n\nImageNet1k: link\nMiniGPT-4: link\n\n","downloads":9,"tags":["license:cc-by-nc-4.0","size_categories:1M<n<10M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-25T16:31:17.000Z","key":""},{"_id":"66a53dc7d40a13036c5f2ebe","id":"mlabonne/FineTome-100k","author":"mlabonne","disabled":false,"gated":false,"lastModified":"2024-07-29T09:52:30.000Z","likes":275,"trendingScore":1,"private":false,"sha":"c2343c1372ff31f51aa21248db18bffa3193efdb","description":"\n\t\n\t\t\n\t\tFineTome-100k\n\t\n\n\nThe FineTome dataset is a subset of arcee-ai/The-Tome (without arcee-ai/qwen2-72b-magpie-en), re-filtered using HuggingFaceFW/fineweb-edu-classifier.\nIt was made for my article \"Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth\".\n","downloads":9182,"tags":["size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-27T18:34:47.000Z","key":""},{"_id":"66a775df7610d8b673cf6aa3","id":"google/granola-entity-questions","author":"google","disabled":false,"gated":false,"lastModified":"2024-08-01T06:13:17.000Z","likes":12,"trendingScore":1,"private":false,"sha":"695e29a15ec509f9e167a920206dad716af5c98d","description":"\n\t\n\t\t\n\t\tGRANOLA Entity Questions Dataset Card\n\t\n\n\n\t\n\t\t\n\t\tDataset details\n\t\n\nDataset Name: GRANOLA-EQ (Granularity of Labels Entity Questions)\nPaper: Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers\nAbstract: Factual questions typically can be answered correctly at different levels of granularity. For example, both \"August 4, 1961\" and \"1961\" are correct answers to the question \"When was Barack Obama born?\"\". Standard question answering (QA)… See the full description on the dataset page: https://huggingface.co/datasets/google/granola-entity-questions.","downloads":179,"tags":["task_categories:question-answering","language:en","license:cc-by-nd-4.0","size_categories:10K<n<100K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2401.04695","region:us"],"createdAt":"2024-07-29T10:58:39.000Z","key":""},{"_id":"66a7d02c95054e8a3183f310","id":"polyglots/Sinhala-NER","author":"polyglots","disabled":false,"gated":false,"lastModified":"2024-07-29T17:24:00.000Z","likes":1,"trendingScore":1,"private":false,"sha":"3532982ea2d42c52d07d91b73559279712c682b8","downloads":39,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-29T17:23:56.000Z","key":""},{"_id":"66a9fba22c7c3ebdd7de3d33","id":"applied-ai-018/pretraining_v1-omega_books","author":"applied-ai-018","disabled":false,"gated":false,"lastModified":"2024-08-05T19:01:31.000Z","likes":7,"trendingScore":1,"private":false,"sha":"54b95f40c52b3d03bb07e672ab05a0e677d3e0a7","downloads":374146,"tags":["size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-07-31T08:53:54.000Z","key":""},{"_id":"66aa41931b978e02d49aaaa4","id":"alfredplpl/artbench-pd-256x256","author":"alfredplpl","disabled":false,"gated":false,"lastModified":"2024-09-02T06:46:15.000Z","likes":9,"trendingScore":1,"private":false,"sha":"4056b70025f0e3f916fb271c7a792633ba637484","description":"\n\t\n\t\t\n\t\tDataset Card for ArtBench Public Domain 256x256\n\t\n\n\n日本語はこちら\nThis repository is the subset of ArtBench.\nArtBench is the dataset for historical arts such as Art Nouveau and Ukiyo-e.\nI picked up public domain images from ArtBench. Then, I create new dataset.\n\n\t\n\t\t\n\t\n\t\n\t\tUsage\n\t\n\nYou can use huggingface datasets to download the dataset.\nYou can also download the tar file.\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"alfredplpl/artbench-pd-256x256\")\n\n\n\t\n\t\t\n\t\tIntended Use… See the full description on the dataset page: https://huggingface.co/datasets/alfredplpl/artbench-pd-256x256.","downloads":184,"tags":["task_categories:text-to-image","task_categories:image-to-text","language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2206.11404","region:us","art"],"createdAt":"2024-07-31T13:52:19.000Z","key":""},{"_id":"66ac9aed3db7c9104bee25e4","id":"lehduong/seaart-hq","author":"lehduong","disabled":false,"gated":false,"lastModified":"2024-08-02T08:38:56.000Z","likes":3,"trendingScore":1,"private":false,"sha":"4321a586faf084ddfa6ab415caae4073af13aa67","downloads":17,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-02T08:38:05.000Z","key":""},{"_id":"66af599fa20def3de3307257","id":"riotu-lab/ArabicQA_2.1M","author":"riotu-lab","disabled":false,"gated":false,"lastModified":"2024-08-04T11:46:12.000Z","likes":6,"trendingScore":1,"private":false,"sha":"6cb82b0bbf5ff1ceb14bcbac113b9d97bc154652","description":"\n\t\n\t\t\n\t\tArabic Question Answering Dataset\n\t\n\n\n\t\n\t\t\n\t\tDescription\n\t\n\n\n\t\n\t\t\n\t\tDataset Overview\n\t\n\nOur dataset is an amalgamation of several filtered datasets, the total number of rows for all datasets was 4,731,600 which was reduced to 2,141,146 rows after filtering. The dataset was collected to fine a pretraind model, the model forced a number of contrains on us discussed in the following section.\n\n\t\n\t\t\n\t\tFiltering Process\n\t\n\nThe filtering process for each dataset included one or more of the… See the full description on the dataset page: https://huggingface.co/datasets/riotu-lab/ArabicQA_2.1M.","downloads":48,"tags":["task_categories:question-answering","language:ar","license:apache-2.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-04T10:36:15.000Z","key":""},{"_id":"66b3c59b45cd1583e1b3ae4b","id":"IntelLabs/Intel_Robotic_Welding_Multimodal_Dataset","author":"IntelLabs","disabled":false,"gated":"manual","lastModified":"2025-09-23T10:56:58.000Z","likes":38,"trendingScore":1,"private":false,"sha":"ea6330626a0c60e3d3dd2d79d165834b6519eba4","description":"\n\t\n\t\t\n\t\tDataset Card for the Intel Robotic Welding Multimodal Dataset\n\t\n\n\n\nThis dataset was collected to enable multimodal welding defect detection research. The dataset contains over 4000 annotated samples and was collected in an automotive production floor setting in collaboration with a supplier with access to such facilities. Each sample contains a video, associated audio, a time-series from welding sensors, and five post-weld images for a particular weld. A separately licensed… See the full description on the dataset page: https://huggingface.co/datasets/IntelLabs/Intel_Robotic_Welding_Multimodal_Dataset.","downloads":69,"tags":["license:other","modality:audio","modality:video","modality:timeseries","modality:image","arxiv:2409.02290","region:us","audio","video","timeseries","image","robotics","welding","defect detection","anomaly detection","defect classification","industry 4.0"],"createdAt":"2024-08-07T19:06:03.000Z","key":""},{"_id":"66b643d56c8d3b367854c523","id":"MERA-evaluation/MERA","author":"MERA-evaluation","disabled":false,"gated":false,"lastModified":"2024-09-24T12:55:46.000Z","likes":10,"trendingScore":1,"private":false,"sha":"73cf223e40f1ef51ba74a048b584874a35c3d88b","description":"\n\t\n\t\t\n\t\tMERA (Multimodal Evaluation for Russian-language Architectures)\n\t\n\n\n\t\n\t\t\n\t\tSummary\n\t\n\nMERA (Multimodal Evaluation for Russian-language Architectures) is a new open independent benchmark for the evaluation of SOTA models for the Russian language.\nThe MERA benchmark unites industry and academic partners in one place to research the capabilities of fundamental models, draw attention to AI-related issues, foster collaboration within the Russian Federation and in the international arena… See the full description on the dataset page: https://huggingface.co/datasets/MERA-evaluation/MERA.","downloads":2967,"tags":["language:ru","license:mit","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-09T16:29:09.000Z","key":""},{"_id":"66b6677a4dd3513d6328ca15","id":"AbstractTTS/IEMOCAP","author":"AbstractTTS","disabled":false,"gated":false,"lastModified":"2024-08-11T05:46:04.000Z","likes":28,"trendingScore":1,"private":false,"sha":"9f1696a135a65ce997d898d4121c952269a822ca","downloads":1304,"tags":["size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-09T19:01:14.000Z","key":""},{"_id":"66b849d8088299999a5faf0b","id":"rohitsaxena/MovieSum","author":"rohitsaxena","disabled":false,"gated":false,"lastModified":"2024-08-14T04:40:05.000Z","likes":24,"trendingScore":1,"private":false,"sha":"cd2fec677719fa3037f1f1513a14186b72a51c4e","description":"\n\t\n\t\t\n\t\tMovieSum: An Abstractive Summarization Dataset for Movie Screenplays\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nMovieSum consists of 2,200 movie screenplays and their corresponding Wikipedia summaries. It is a long-form summarization task where the mean length of movie screenplays is approximately 34K. We manually formatted the movie screenplays to represent their structural elements. We also provide the IMDB ID for each movie to facilitate the collection of additional metadata.\n\n\t\n\t\t\n\t\tDataset… See the full description on the dataset page: https://huggingface.co/datasets/rohitsaxena/MovieSum.","downloads":323,"tags":["size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2408.06281","region:us"],"createdAt":"2024-08-11T05:19:20.000Z","key":""},{"_id":"66b9c17c0a5131d78fd8d500","id":"AbstractTTS/PODCAST","author":"AbstractTTS","disabled":false,"gated":false,"lastModified":"2024-08-12T14:30:24.000Z","likes":17,"trendingScore":1,"private":false,"sha":"e791939e368b6c26595ca7f2c75aa29ebd92e776","downloads":1073,"tags":["size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-12T08:02:04.000Z","key":""},{"_id":"66baa0a65b3a757a7401ceba","id":"HuggingFaceTB/everyday-conversations-llama3.1-2k","author":"HuggingFaceTB","disabled":false,"gated":false,"lastModified":"2025-01-29T23:16:26.000Z","likes":133,"trendingScore":1,"private":false,"sha":"14f543216b9ba42b6b951dc5bd199460d193b162","description":"\n\t\n\t\t\n\t\tEveryday conversations for Smol LLMs finetunings\n\t\n\nThis dataset contains 2.2k multi-turn conversations generated by Llama-3.1-70B-Instruct. We ask the LLM to generate a simple multi-turn conversation, with 3-4 short exchanges, between a User and an AI Assistant about a certain topic.\nThe topics are chosen to be simple to understand by smol LLMs and cover everyday topics + elementary science. We include:\n\n20 everyday topics with 100 subtopics each\n43 elementary science topics with 10… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k.","downloads":1959,"tags":["language:en","license:apache-2.0","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-12T23:54:14.000Z","key":""},{"_id":"66bc06dc6da7aec8413d35ba","id":"NousResearch/hermes-function-calling-v1","author":"NousResearch","disabled":false,"gated":false,"lastModified":"2026-01-03T13:32:47.000Z","likes":423,"trendingScore":1,"private":false,"sha":"dae3e1d28cfbcf4b915c04ea1e072030529b4bda","description":"\n\n\t\n\t\t\n\t\tHermes Function-Calling V1\n\t\n\nThis dataset is the compilation of structured output and function calling data used in the Hermes 2 Pro series of models.\nThis repository contains a structured output dataset with function-calling conversations, json-mode, agentic json-mode and structured extraction samples, designed to train LLM models in performing function calls and returning structured output based on natural language instructions. The dataset features various conversational scenarios… See the full description on the dataset page: https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1.","downloads":24289,"tags":["task_categories:text-generation","task_categories:question-answering","task_categories:feature-extraction","language:en","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","tool-use,","function-calling","agentic","synthetic-data"],"createdAt":"2024-08-14T01:22:36.000Z","key":""},{"_id":"66bce2e10dc132aeea62ab23","id":"microsoft/mocapact-data","author":"microsoft","disabled":false,"gated":false,"lastModified":"2024-08-17T04:58:49.000Z","likes":6,"trendingScore":1,"private":false,"sha":"5e928d84245746a7b89b40605fb5bada8402bc5c","description":"\n\t\n\t\t\n\t\tMoCapAct Dataset\n\t\n\nControl of simulated humanoid characters is a challenging benchmark for sequential decision-making methods, as it assesses a policy’s ability to drive an inherently unstable, discontinuous, and high-dimensional physical system. Motion capture (MoCap) data can be very helpful in learning sophisticated locomotion policies by teaching a humanoid agent low-level skills (e.g., standing, walking, and running) that can then be used to generate high-level behaviors. However… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/mocapact-data.","downloads":108,"tags":["license:cdla-permissive-2.0","region:us"],"createdAt":"2024-08-14T17:01:21.000Z","key":""},{"_id":"66bd9943a03b764ca98f0c11","id":"MahtaFetrat/Mana-TTS","author":"MahtaFetrat","disabled":false,"gated":false,"lastModified":"2025-07-12T12:32:59.000Z","likes":24,"trendingScore":1,"private":false,"sha":"698f7f7bfe7a8fc665e9c4c0ca9eea7e0d49f64f","description":"\n\t\n\t\t\n\t\tManaTTS-Persian-Speech-Dataset\n\t\n\nManaTTS is the largest publicly available single-speaker Persian corpus, comprising over 114 hours of high-quality audio (sampled at 44.1 kHz). Released under the permissive CC-0 license, this dataset is freely usable for both educational and commercial purposes.  \nCollected from Nasl-e-Mana magazine, the dataset covers a diverse range of topics, making it ideal for training robust text-to-speech (TTS) models. The release includes a fully transparent… See the full description on the dataset page: https://huggingface.co/datasets/MahtaFetrat/Mana-TTS.","downloads":3952,"tags":["language:fa","license:cc0-1.0","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","doi:10.57967/hf/2888","region:us","text-to-speech","tts","speech-synthesis","persian","data-collection","data-preprocessing","speech-processing","forced-alignment","speech-dataset","speech-corpus","dataset-preparation","persian-speech","tts-dataset","text-to-speech-dataset","mana-tts","manatts","speech-data-collection"],"createdAt":"2024-08-15T05:59:31.000Z","key":""},{"_id":"66c1dd7d70eace5a999d1bf3","id":"starhopp3r/TinyChat","author":"starhopp3r","disabled":false,"gated":false,"lastModified":"2025-10-02T13:17:36.000Z","likes":21,"trendingScore":1,"private":false,"sha":"1a19729d536023d40072d491d4691e4c71461e92","description":"\n\t\n\t\t\n\t\tSynthetic Short Chat Conversations Dataset using BASIC English\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset comprises 1,000,000 synthetically generated short chat conversations, created using a specialized version of GPT-4o (referred to as GPT-4o mini). The conversations are primarily constructed using BASIC (British Academic Scientific International Commercial) English words and grammar. However, to ensure the coherence and fluidity of the dialogues, some non-BASIC English words have been… See the full description on the dataset page: https://huggingface.co/datasets/starhopp3r/TinyChat.","downloads":221,"tags":["language:en","license:cc-by-nc-4.0","size_categories:1M<n<10M","format:text","modality:text","library:datasets","library:mlcroissant","arxiv:2305.07759","region:us"],"createdAt":"2024-08-18T11:39:41.000Z","key":""},{"_id":"66c2279c8e95eabff210fcf8","id":"aslawliet/cn-k12","author":"aslawliet","disabled":false,"gated":false,"lastModified":"2024-08-18T16:59:00.000Z","likes":13,"trendingScore":1,"private":false,"sha":"31abc091ad6f8240673c00b2e9b2404da2e0bfcc","downloads":77,"tags":["license:cc-by-nc-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-18T16:55:56.000Z","key":""},{"_id":"66c49664d49ec9cf2d39eb2a","id":"TsinghuaC3I/UltraMedical-Preference","author":"TsinghuaC3I","disabled":false,"gated":false,"lastModified":"2024-08-20T13:16:46.000Z","likes":11,"trendingScore":1,"private":false,"sha":"761eb7935310ba662a96d93c5af342e5269d5759","description":"\n\t\n\t\t\n\t\tDataset Card for Dataset Name\n\t\n\n\n\t\n\t\t\n\t\tThe UltraMedical-Preference Dataset\n\t\n\nThe UltraMedical-Preference dataset enhances the UltraMedical collection with preference annotations. This dataset includes responses sampled from both open-source and proprietary models, annotated for user preferences.\n\n\t\n\t\t\n\t\tModels Used for Annotation\n\t\n\nFor the proprietary models, we utilized gpt-3.5-turbo and gpt-4-turbo. For open-source models, selections included Llama-3-8B/70B, Qwen1.5-72B… See the full description on the dataset page: https://huggingface.co/datasets/TsinghuaC3I/UltraMedical-Preference.","downloads":237,"tags":["language:en","license:mit","size_categories:100K<n<1M","arxiv:2406.03949","region:us"],"createdAt":"2024-08-20T13:13:08.000Z","key":""},{"_id":"66c582fe30010c0f2bba4176","id":"Team-ACE/ToolACE","author":"Team-ACE","disabled":false,"gated":false,"lastModified":"2024-09-04T02:37:59.000Z","likes":185,"trendingScore":1,"private":false,"sha":"6bda777c88d21e5a204703c1ee45597a8fa4f734","description":"\n\t\n\t\t\n\t\tToolACE\n\t\n\nToolACE is an automatic agentic pipeline designed to generate Accurate, Complex, and divErse tool-learning data. \nToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. \nDialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. \nTo ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. \nMore details… See the full description on the dataset page: https://huggingface.co/datasets/Team-ACE/ToolACE.","downloads":7906,"tags":["task_categories:text-generation","language:en","language:zh","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2409.00920","region:us","synthetic","tools"],"createdAt":"2024-08-21T06:02:38.000Z","key":""},{"_id":"66c5cde04931633cd69cb814","id":"ChuGyouk/AI_healthcare_QA","author":"ChuGyouk","disabled":false,"gated":"manual","lastModified":"2024-08-21T16:26:02.000Z","likes":6,"trendingScore":1,"private":false,"sha":"2a0ef7b2ef4fefff8676335e8507d93589875c56","description":"\n\t\n\t\t\n\t\tDataset Description\n\t\n\nFrom Super-large AI healthcare Q&A data [AIHUB], \nI sampled a subset of questions from the training data, and then obtained responses using gpt-4o-2024-08-06 and gpt-4o-mini-2024-07-18.\n\n\t\n\t\t\n\t\tCAUTION\n\t\n\nThe provided data was generated by GPT and should not be considered as professional medical advice, diagnosis, or treatment. For accurate diagnosis and treatment of any specific medical issue, please consult a qualified physician or healthcare professional.… See the full description on the dataset page: https://huggingface.co/datasets/ChuGyouk/AI_healthcare_QA.","downloads":8,"tags":["task_categories:question-answering","language:ko","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","medical"],"createdAt":"2024-08-21T11:22:08.000Z","key":""},{"_id":"66cc3d7e0aaaa6753b5418f9","id":"lightonai/ms-marco-en-bge","author":"lightonai","disabled":false,"gated":false,"lastModified":"2025-09-11T12:01:12.000Z","likes":7,"trendingScore":1,"private":false,"sha":"ad2472943e41db62b7883838ad7b2dbc7d572b21","description":"\n\t\n\t\t\n\t\tms-marco-en-bge\n\t\n\nThis dataset contains the MS MARCO dataset with documents similar to the query mined using BGE-M3 and then scored by bge-reranker-v2-m3. \nIt can be used to train a retrieval model using knowledge distillation, for example using PyLate.\n\n\t\n\t\t\n\t\n\t\n\t\tknowledge distillation\n\t\n\nTo fine-tune a model using knowledge distillation loss we will need three distinct file:\n\nDatasetsfrom datasets import load_dataset\n\ntrain = load_dataset(\n    \"lightonai/ms-marco-en-bge\"… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/ms-marco-en-bge.","downloads":247,"tags":["task_categories:feature-extraction","task_categories:sentence-similarity","multilinguality:monolingual","language:en","size_categories:10M<n<100M","format:parquet","modality:text","modality:timeseries","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","sentence-transformers","ColBERT","lightonai","PyLate"],"createdAt":"2024-08-26T08:31:58.000Z","key":""},{"_id":"66cdef1db1ed0830df16e44b","id":"Post-training-Data-Flywheel/AutoIF-instruct-61k","author":"Post-training-Data-Flywheel","disabled":false,"gated":false,"lastModified":"2024-08-27T19:04:19.000Z","likes":16,"trendingScore":1,"private":false,"sha":"0705aa867cce36d595123cb34f4a35281e6acc7c","downloads":108,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-27T15:22:05.000Z","key":""},{"_id":"66ce06adfcb7b0d255206a40","id":"lamm-mit/spider-silk-benchmark","author":"lamm-mit","disabled":false,"gated":false,"lastModified":"2024-08-27T17:02:38.000Z","likes":1,"trendingScore":1,"private":false,"sha":"e20ae8e0ae1685738d7f6dd567ad613d6a09df2d","downloads":41,"tags":["size_categories:n<1K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-27T17:02:37.000Z","key":""},{"_id":"66cee624e06625ccda155eeb","id":"deepghs/fgo_voices_jp","author":"deepghs","disabled":false,"gated":false,"lastModified":"2024-08-28T09:14:22.000Z","likes":16,"trendingScore":1,"private":false,"sha":"8fe1cbd07be28420b3cbf957dcfce74a4ff343f3","description":"\n\t\n\t\t\n\t\tJP Voice-Text Dataset for FGO Waifus\n\t\n\nThis is the JP voice-text dataset for FGO playable characters. Very useful for fine-tuning or evaluating ASR/ASV models.\nOnly the voices with strictly one voice actor is maintained here to reduce the noise of this dataset.\n30800 records, 66.4 hours in total. Average duration is approximately 7.76s.\n\n\t\n\t\t\nid\nchar_id\nvoice_actor_name\nvoice_title\nvoice_text\ntime\nsample_rate\nfile_size\nfilename\nmimetype\nfile_url\n\n\n\t\t\nchar_1_SV1_0_对话12\n1\n高桥李依\n对话 12… See the full description on the dataset page: https://huggingface.co/datasets/deepghs/fgo_voices_jp.","downloads":69,"tags":["task_categories:automatic-speech-recognition","task_categories:audio-classification","language:ja","license:other","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","modality:audio","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","audio","text","voice","anime","fgo"],"createdAt":"2024-08-28T08:56:04.000Z","key":""},{"_id":"66d1bb0a64c1e9b7328e9785","id":"japanese-asr/en_asr.mls","author":"japanese-asr","disabled":false,"gated":false,"lastModified":"2024-09-04T12:04:06.000Z","likes":3,"trendingScore":1,"private":false,"sha":"4febbbaf87288f6149a46757dc163062b740db3a","downloads":3172,"tags":["size_categories:10M<n<100M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-30T12:28:58.000Z","key":""},{"_id":"66d1f9f2ad293ffc4b1cdfa7","id":"japanese-asr/ja_asr.reazon_speech_all","author":"japanese-asr","disabled":false,"gated":false,"lastModified":"2024-09-01T03:30:53.000Z","likes":7,"trendingScore":1,"private":false,"sha":"10c81088a41b64a99a94f5847a437e248b6a963b","downloads":1685,"tags":["size_categories:10M<n<100M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-30T16:57:22.000Z","key":""},{"_id":"66d228596accd34f751c4a7b","id":"Luna288/image-captioning-FACAD-base","author":"Luna288","disabled":false,"gated":false,"lastModified":"2024-09-04T22:36:50.000Z","likes":1,"trendingScore":1,"private":false,"sha":"b9fb184c3828ef0013d38fe6264a1012eb4e2936","downloads":246,"tags":["task_categories:image-to-text","language:en","license:bsd-3-clause","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-08-30T20:15:21.000Z","key":""},{"_id":"66d3ea8ed51528a038454850","id":"gabrielchua/off-topic","author":"gabrielchua","disabled":false,"gated":false,"lastModified":"2024-11-23T01:41:36.000Z","likes":12,"trendingScore":1,"private":false,"sha":"226c66f56fbbd032b17ea07528b7057078eb0dc6","description":"\n\t\n\t\t\n\t\tOff-Topic Guardrails Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset consists of synthetic LLM system prompts paired with user prompts, classified as either off-topic or on-topic. The aim is to provide realistic, real-world-inspired examples reflecting how large language models (LLMs) are used today for both open-ended and closed-ended tasks, such as text generation and classification. This dataset can be used for training and benchmarking off-topic guardrails.\n\n\t\n\t\t\n\t\n\t\n\t\tSynthetic Data… See the full description on the dataset page: https://huggingface.co/datasets/gabrielchua/off-topic.","downloads":509,"tags":["language:en","license:mit","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2411.12946","region:us"],"createdAt":"2024-09-01T04:16:14.000Z","key":""},{"_id":"66d4e93b35eff7194d046f4c","id":"vajdaad4m/minecraft-skins-1.5k","author":"vajdaad4m","disabled":false,"gated":false,"lastModified":"2024-09-02T06:24:32.000Z","likes":1,"trendingScore":1,"private":false,"sha":"2f4640f0569d87080e3d7300f4b42462e54a5490","downloads":38,"tags":["size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-01T22:22:51.000Z","key":""},{"_id":"66d5dd31bc1b7d385ed032c3","id":"ScalingIntelligence/monkey_business","author":"ScalingIntelligence","disabled":false,"gated":false,"lastModified":"2025-10-08T00:22:01.000Z","likes":20,"trendingScore":1,"private":false,"sha":"a9f8f73bcd6948a57ed922cba4e48062ef95f553","description":"\n\t\n\t\t\n\t\tMonkey Business\n\t\n\nMonkey Business is a dataset of samples from large language models. It contains both correct and incorrect samples from a variety of models (the Llama-3, Gemma, and Pythia series) on a variety of tasks (problems from GSM8K, MATH, CodeContests, and MiniF2F-MATH). We hope that it can be useful for developing improved verification methods that assess whether a model generated answer is correct.\nThis dataset was created as part of the project: \"Large Language Monkeys:… See the full description on the dataset page: https://huggingface.co/datasets/ScalingIntelligence/monkey_business.","downloads":545,"tags":["multilinguality:monolingual","language:en","license:mit","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2407.21787","arxiv:2206.14858","arxiv:2109.00110","region:us","math-word-problems","verifiers"],"createdAt":"2024-09-02T15:43:45.000Z","key":""},{"_id":"66d96022a243cba4b64da8bb","id":"LangAGI-Lab/world_model_for_wa_desc_with_tao_formatted_w_cot","author":"LangAGI-Lab","disabled":false,"gated":false,"lastModified":"2024-09-05T07:39:51.000Z","likes":1,"trendingScore":1,"private":false,"sha":"ce530363ef465214468959d056e002ee1e2b1419","description":"\n\t\n\t\t\n\t\tDataset Card for \"world_model_for_wa_desc_with_tao_formatted_w_cot\"\n\t\n\nMore Information needed\n","downloads":13,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-05T07:39:14.000Z","key":""},{"_id":"66dc4e6348cb97a720daea22","id":"japanese-asr/whisper_transcriptions.reazon_speech_all","author":"japanese-asr","disabled":false,"gated":false,"lastModified":"2024-09-14T08:02:36.000Z","likes":15,"trendingScore":1,"private":false,"sha":"96995b6abe6f447be95f4d6b7daa36476b809b46","downloads":116783,"tags":["size_categories:10M<n<100M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-07T13:00:19.000Z","key":""},{"_id":"66dcf8ad3eeb464a8808c7a9","id":"nyuuzyou/classnotes","author":"nyuuzyou","disabled":false,"gated":false,"lastModified":"2024-10-11T15:08:17.000Z","likes":4,"trendingScore":1,"private":false,"sha":"2607b411e57b372ce62001d2e7dc85df4a0fc69e","description":"\n\t\n\t\t\n\t\tDataset Card for конспекты-уроков.рф\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset contains metadata about 65,068 lesson plans from the конспекты-уроков.рф (meaning in English would be something like class-notes[.]ru) platform, with 58,433 files available in their original format. The dataset includes information such as lesson plan titles, descriptions, authors, publication dates, and file sizes. The lesson plans are primarily in Russian and cover various educational subjects and grade… See the full description on the dataset page: https://huggingface.co/datasets/nyuuzyou/classnotes.","downloads":41,"tags":["task_categories:text-classification","task_categories:text-retrieval","annotations_creators:found","multilinguality:multilingual","source_datasets:original","language:ru","language:kk","language:uk","language:be","language:en","language:multilingual","license:cc-by-nc-3.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-08T01:06:53.000Z","key":""},{"_id":"66e0790c3905daceebb8ba82","id":"milistu/GUM-NER-conll","author":"milistu","disabled":false,"gated":false,"lastModified":"2024-09-12T15:31:47.000Z","likes":1,"trendingScore":1,"private":false,"sha":"4c370b09378284c370fa4907db40e58fca38e2f5","description":"\n\t\n\t\t\n\t\tGUM: The Georgetown University Multilayer Corpus\n\t\n\nThe GUM corpus was collected and annotated at Georgetown University. For more information, see the LICENSE.\n\n\t\n\t\t\n\t\tStructure\n\t\n\n\nNumber of labels: 23\n\n['O',\n 'B-abstract', 'I-abstract',\n 'B-animal', 'I-animal',\n 'B-event', 'I-event',\n 'B-object', 'I-object',\n 'B-organization', 'I-organization',\n 'B-person', 'I-person',\n 'B-place', 'I-place',\n 'B-plant', 'I-plant',\n 'B-quantity', 'I-quantity',\n 'B-substance', 'I-substance',\n 'B-time'… See the full description on the dataset page: https://huggingface.co/datasets/milistu/GUM-NER-conll.","downloads":136,"tags":["task_categories:token-classification","language:en","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-10T16:51:24.000Z","key":""},{"_id":"66e156679460cd79bfe3b2d6","id":"longisland3/ptb-xl","author":"longisland3","disabled":false,"gated":false,"lastModified":"2024-09-17T08:45:41.000Z","likes":1,"trendingScore":1,"private":false,"sha":"34a5563a01793b150ac61fe0ec919a09fc0d044a","downloads":12603,"tags":["region:us"],"createdAt":"2024-09-11T08:35:51.000Z","key":""},{"_id":"66e1a2fb91e57a0788b501cb","id":"jackyhate/text-to-image-2M","author":"jackyhate","disabled":false,"gated":false,"lastModified":"2026-04-30T00:35:48.000Z","likes":164,"trendingScore":1,"private":false,"sha":"ef8a355d984e75ab579ff0f3e7dd0ebda9fe4ead","description":"\n\t\n\t\t\n\t\ttext-to-image-2M: A High-Quality, Diverse Text-to-Image Training Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\ntext-to-image-2M is a curated text-image pair dataset designed for fine-tuning text-to-image models. The dataset consists of approximately 2 million samples, carefully selected and enhanced to meet the high demands of text-to-image model training. The motivation behind creating this dataset stems from the observation that datasets with over 1 million samples tend to produce better… See the full description on the dataset page: https://huggingface.co/datasets/jackyhate/text-to-image-2M.","downloads":9635,"tags":["task_categories:text-to-image","task_categories:image-to-text","task_categories:image-classification","language:en","license:mit","size_categories:100K<n<1M","format:webdataset","modality:image","modality:text","library:datasets","library:webdataset","library:mlcroissant","doi:10.57967/hf/3066","region:us"],"createdAt":"2024-09-11T14:02:35.000Z","key":""},{"_id":"66e34fac10b94d9826c33f2d","id":"milistu/Wikigold-NER-conll","author":"milistu","disabled":false,"gated":false,"lastModified":"2024-09-16T09:59:09.000Z","likes":1,"trendingScore":1,"private":false,"sha":"9c93054c1c337fb6753b9487640d72a11eff10f2","description":"\n\t\n\t\t\n\t\tNamed Entity Recognition in Wikipedia\n\t\n\nWikiGold is a manually annotated collection of Wikipedia text.\n\n\t\n\t\t\n\t\tStructure\n\t\n\n\nNumber of labels: 9\n\n['O', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER']\n\n\n\t\n\t\t\n\t\n\t\n\t\tTrain set\n\t\n\nNumber of sentences in the train set: 1472\nLabel count in train set:\n\n\t\n\t\t\nLabel\nCount\nPercentage (%)\n\n\n\t\t\nO\n26086\n83.649190\n\n\nI-ORG\n1470\n4.713805\n\n\nI-PER\n1303\n4.178291\n\nI-LOC\n1175\n3.767837\n\n\nI-MISC\n1151\n3.690877\n\n\n\t\n\n\n\t\n\t\t\n\t\tTest set… See the full description on the dataset page: https://huggingface.co/datasets/milistu/Wikigold-NER-conll.","downloads":66,"tags":["task_categories:token-classification","language:en","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-12T20:31:40.000Z","key":""},{"_id":"66e3a995d3fa06c302728556","id":"cfli/bge-full-data","author":"cfli","disabled":false,"gated":false,"lastModified":"2024-10-11T04:20:33.000Z","likes":43,"trendingScore":1,"private":false,"sha":"78f5c99b534a52824ab26bd24edda592eaed4c7a","downloads":850,"tags":["region:us"],"createdAt":"2024-09-13T02:55:17.000Z","key":""},{"_id":"66e46a3f6e6ce3af7295dde6","id":"openai/MMMLU","author":"openai","disabled":false,"gated":false,"lastModified":"2024-10-16T18:39:00.000Z","likes":523,"trendingScore":1,"private":false,"sha":"325a01dc3e173cac1578df94120499aaca2e2504","description":"\n\t\n\t\t\n\t\tMultilingual Massive Multitask Language Understanding (MMMLU)\n\t\n\nThe MMLU is a widely recognized benchmark of general knowledge attained by AI models. It covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science.\nWe translated the MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases… See the full description on the dataset page: https://huggingface.co/datasets/openai/MMMLU.","downloads":9749,"tags":["task_categories:question-answering","language:ar","language:bn","language:de","language:es","language:fr","language:hi","language:id","language:it","language:ja","language:ko","language:pt","language:sw","language:yo","language:zh","license:mit","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2009.03300","region:us"],"createdAt":"2024-09-13T16:37:19.000Z","key":""},{"_id":"66e6273ae9fb6492ba091602","id":"BAAI/IndustryCorpus2_aerospace","author":"BAAI","disabled":false,"gated":false,"lastModified":"2024-09-18T04:07:50.000Z","likes":5,"trendingScore":1,"private":false,"sha":"b60c1d0690aa96b3761adca14aad3a997ed98b9c","downloads":40,"tags":["size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-15T00:15:54.000Z","key":""},{"_id":"66ebb7af703a567feca77e83","id":"BAAI/CCI3-HQ","author":"BAAI","disabled":false,"gated":"auto","lastModified":"2024-11-11T12:27:29.000Z","likes":60,"trendingScore":1,"private":false,"sha":"d6f3aa30cebfef497e822ff968ed68a18bf90b8f","description":"\n\t\n\t\t\n\t\tData Description\n\t\n\nTo address the scarcity of high-quality safety datasets in the Chinese, we open-sourced the CCI (Chinese Corpora Internet) dataset on November 29, 2023. \nBuilding on this foundation, we continue to expand the data source, adopt stricter data cleaning methods, and complete the construction of the CCI 3.0 dataset. This dataset is composed of high-quality, reliable Internet data from trusted sources. \nAnd then with more stricter filtering, The CCI 3.0 HQ corpus… See the full description on the dataset page: https://huggingface.co/datasets/BAAI/CCI3-HQ.","downloads":3882,"tags":["task_categories:text-generation","language:zh","size_categories:10M<n<100M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","arxiv:2410.18505","region:us"],"createdAt":"2024-09-19T05:33:35.000Z","key":""},{"_id":"66ed25122754b5db61760df6","id":"OpenGVLab/InternVL-SA-1B-Caption","author":"OpenGVLab","disabled":false,"gated":false,"lastModified":"2024-09-21T04:41:15.000Z","likes":23,"trendingScore":1,"private":false,"sha":"eaf6bc78cb23d4c654f0db7b8684cf5e1a9957a0","description":"\n\t\n\t\t\n\t\tDataset Card for InternVL-SA-1B-Caption\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe InternVL-SA-1B-Caption Dataset is a bilingual dataset created using the InternVL2-Llama3-76B model. The dataset contains 12 million image-caption pairs in both English and Chinese. All images are sourced from Meta’s SA-1B dataset, and captions were generated using specific prompts designed to minimize hallucinations and ensure accurate descriptions based on visible image content. The dataset is intended for use in tasks… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/InternVL-SA-1B-Caption.","downloads":250,"tags":["license:mit","size_categories:1M<n<10M","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2312.14238","arxiv:2404.16821","region:us"],"createdAt":"2024-09-20T07:32:34.000Z","key":""},{"_id":"66ee820716362f8e97f5e816","id":"MehdiHosseiniMoghadam/ConvFinQA","author":"MehdiHosseiniMoghadam","disabled":false,"gated":false,"lastModified":"2024-09-21T08:31:46.000Z","likes":2,"trendingScore":1,"private":false,"sha":"04f871d29cc8735ebf2eeb9c0c81f2c8d9148d01","description":"\n\t\n\t\t\n\t\ttitle: Financial Document QA Dataset\ntags:\n  - financial-documents\n  - question-answering\n  - tabular-data\n  - machine-learning\nlicense: apache-2.0\ndatasets:\n  - financial-document-qa\nlanguage: en\n\t\n\n\n\t\n\t\t\n\t\tConvFinQA: Financial Document QA Dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe ConvFinQA dataset and code from EMNLP 2022 paper(https://arxiv.org/abs/2210.03849)\n\n\t\n\t\t\n\t\tDataset Features\n\t\n\n\nPre-text: Contextual paragraphs that precede a table, giving information relevant to the… See the full description on the dataset page: https://huggingface.co/datasets/MehdiHosseiniMoghadam/ConvFinQA.","downloads":402,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2210.03849","region:us"],"createdAt":"2024-09-21T08:21:27.000Z","key":""},{"_id":"66f14a7926413db5f6c63b5b","id":"FBK-MT/mosel","author":"FBK-MT","disabled":false,"gated":false,"lastModified":"2025-10-07T08:07:55.000Z","likes":90,"trendingScore":1,"private":false,"sha":"40d224f5d6d61246ab80c369c21405c24952bbf1","description":"\n\n\n\t\n\t\t\n\t\tDataset Description, Collection, and Source\n\t\n\nThe MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled speech corpora under open-source compliant licenses.\nIn particular, MOSEL includes the automatic transcripts of 441k hours of unlabeled speech from VoxPopuli and LibriLight. The data is transcribed using Whisper large… See the full description on the dataset page: https://huggingface.co/datasets/FBK-MT/mosel.","downloads":1360,"tags":["task_categories:automatic-speech-recognition","task_categories:text-to-speech","annotations_creators:machine-generated","language_creators:found","multilinguality:multilingual","language:en","language:bg","language:hr","language:cs","language:da","language:nl","language:et","language:fi","language:fr","language:de","language:el","language:hu","language:ga","language:it","language:lv","language:lt","language:mt","language:pl","language:pt","language:ro","language:sk","language:sl","language:es","language:sv","license:cc-by-4.0","size_categories:1M<n<10M","modality:image","modality:tabular","modality:text","modality:audio","arxiv:2410.01036","region:us","speech","speech-to-text","open-source","whisper","text","audio"],"createdAt":"2024-09-23T11:01:13.000Z","key":""},{"_id":"66f52e56b3ab81ac10f78c4c","id":"VTSNLP/vietnamese_curated_dataset","author":"VTSNLP","disabled":false,"gated":false,"lastModified":"2024-11-24T02:23:21.000Z","likes":75,"trendingScore":1,"private":false,"sha":"b81fcce58945970117a1b56d50ec81be2628a5c3","description":"\n\t\n\t\t\n\t\tDataset Description\n\t\n\nVietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with NeMo Curator\n\nDeveloped by: Viettel Solutions\nLanguage: Vietnamese\n\n\n\t\n\t\t\n\t\tDetails\n\t\n\nPlease visit our Tech Blog post on NVIDIA's plog page for details. Link\n\n\t\n\t\t\n\t\tData Collection\n\t\n\nWe utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include: \n\nThe… See the full description on the dataset page: https://huggingface.co/datasets/VTSNLP/vietnamese_curated_dataset.","downloads":924,"tags":["size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-26T09:50:14.000Z","key":""},{"_id":"66f65526a395d5e5ede5a36c","id":"weizhiwang/Open-Qwen2VL-Data","author":"weizhiwang","disabled":false,"gated":false,"lastModified":"2025-04-16T00:39:28.000Z","likes":25,"trendingScore":1,"private":false,"sha":"1b6bbe8e86db2400e6e27a631ce5e47c5a7bcc0c","description":"\n\t\n\t\t\n\t\tIntroduction\n\t\n\nThis repository contains the data for Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources.\nProject page: https://victorwz.github.io/Open-Qwen2VL\nCode: https://github.com/Victorwz/Open-Qwen2VL\n\n\t\n\t\t\n\t\n\t\n\t\tDataset\n\t\n\n\nccs_ebdataset: CC3M-CC12M-SBU filtered by CLIP, we directly download the webdataset based on the released of curated subset of BLIP-1\ndatacomp_medium_dfn_webdataset: DataComp-Medium-128M filtered by DFN, we just… See the full description on the dataset page: https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data.","downloads":1384,"tags":["task_categories:image-text-to-text","license:mit","size_categories:10M<n<100M","format:parquet","modality:image","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2504.00595","region:us"],"createdAt":"2024-09-27T06:48:06.000Z","key":""},{"_id":"66f7c17c58bd6bdc3510a9bd","id":"chemouda/antler_portfolio","author":"chemouda","disabled":false,"gated":false,"lastModified":"2024-09-28T08:43:00.000Z","likes":2,"trendingScore":1,"private":false,"sha":"69d0f2e41ad6fcdb9c23d1a260715ba21aa043ff","downloads":46,"tags":["size_categories:n<1K","format:csv","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-28T08:42:36.000Z","key":""},{"_id":"66f82294d88dc2ad510fb72c","id":"milistu/Pile-NER-type-conll","author":"milistu","disabled":false,"gated":false,"lastModified":"2025-03-21T16:25:58.000Z","likes":3,"trendingScore":1,"private":false,"sha":"bb9023ed25d63d2e52bd74200f6a5e3e63406f11","description":"\n\t\n\t\t\n\t\tPile-NER Dataset in CoNLL Format\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe Pile-NER-type dataset provides named entity recognition annotations derived from The Pile, a large-scale text corpus. This dataset is formatted in CoNLL style for easy use with token classification models.\n\n\t\n\t\t\n\t\tStructure\n\t\n\n\nFormat: CoNLL\nSplit: Train only (45,889 examples)\nFeatures:\nid: Unique identifier for each example\nwords: Sequence of tokens\nner_tags: Named entity tags for each token\nlabels: Label annotations for each… See the full description on the dataset page: https://huggingface.co/datasets/milistu/Pile-NER-type-conll.","downloads":94,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-28T15:36:52.000Z","key":""},{"_id":"66f838d23fab21c9a08e1d12","id":"MahiA/UrbanSound8K","author":"MahiA","disabled":false,"gated":false,"lastModified":"2024-11-21T15:08:38.000Z","likes":2,"trendingScore":1,"private":false,"sha":"1b8115722ecb83284854a1452b2d702c27ad5bcc","description":"\n\t\n\t\t\n\t\tUrbanSound8K\n\t\n\nThis is an audio classification dataset for Sound Event Classification.\nClasses = 10    ,    Split = Ten-Fold \n\n\t\n\t\t\n\t\tStructure\n\t\n\n\naudios folder contains audio files.\ncsv_files folder contains CSV files for ten-fold cross-validation.\nTo perform cross-validation on fold 1, train_1.csv will be used for the training split and test_1.csv for the testing split, with the same pattern followed for the other folds.\nTo perform training and testing witout cross-validation, use… See the full description on the dataset page: https://huggingface.co/datasets/MahiA/UrbanSound8K.","downloads":2978,"tags":["license:mit","size_categories:10K<n<100K","format:csv","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-09-28T17:11:46.000Z","key":""},{"_id":"66f83cd9b720048bbc4a3db7","id":"milistu/NuNER-conll","author":"milistu","disabled":false,"gated":false,"lastModified":"2025-03-21T16:30:57.000Z","likes":3,"trendingScore":1,"private":false,"sha":"379944bad9293cb735189241c15f72605d2a5367","description":"\n\t\n\t\t\n\t\tNuNER Dataset in CoNLL Format\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe NuNER dataset is a large-scale named entity recognition corpus in CoNLL format, containing nearly 1 million annotated examples. It's designed for training robust NER models across various domains and entity types.\n\n\t\n\t\t\n\t\tStructure\n\t\n\n\nFormat: CoNLL\nSplit: Train only (971,842 examples)\nFeatures:\nid: Numeric identifier for each example\nwords: Sequence of tokens\nner_tags: Named entity tags for each token\nlabels: Label annotations… See the full description on the dataset page: https://huggingface.co/datasets/milistu/NuNER-conll.","downloads":41,"tags":["size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2402.15343","region:us"],"createdAt":"2024-09-28T17:28:57.000Z","key":""},{"_id":"66fdfc4c7d722f08797ef6b1","id":"haoranxu/X-ALMA-Parallel-Data","author":"haoranxu","disabled":false,"gated":false,"lastModified":"2024-10-07T06:11:17.000Z","likes":8,"trendingScore":1,"private":false,"sha":"87ec2cb8891b9a1fb53b844caef42bf7e1e487dc","description":"\nThis is the translation parallel dataset used by X-ALMA.\n@misc{xu2024xalmaplugplay,\n      title={X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale}, \n      author={Haoran Xu and Kenton Murray and Philipp Koehn and Hieu Hoang and Akiko Eriguchi and Huda Khayrallah},\n      year={2024},\n      eprint={2410.03115},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2410.03115}, \n}\n\n","downloads":92,"tags":["language:en","language:da","language:nl","language:de","language:is","language:no","language:sc","language:af","language:ca","language:ro","language:gl","language:it","language:pt","language:es","language:bg","language:mk","language:sr","language:uk","language:ru","language:id","language:ms","language:th","language:vi","language:mg","language:fr","language:hu","language:el","language:cs","language:pl","language:lt","language:lv","language:ka","language:zh","language:ja","language:ko","language:fi","language:et","language:gu","language:hi","language:mr","language:ne","language:ur","language:az","language:kk","language:ky","language:tr","language:uz","language:ar","language:he","language:fa","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2410.03115","region:us"],"createdAt":"2024-10-03T02:07:08.000Z","key":""},{"_id":"66fea77632224f5ddc4020c3","id":"princeton-nlp/prolong-data-512K","author":"princeton-nlp","disabled":false,"gated":false,"lastModified":"2025-03-05T20:30:35.000Z","likes":13,"trendingScore":1,"private":false,"sha":"0bb2becdbb958cf29804da01ae615eb838170aa4","description":"\n\t\n\t\t\n\t\tprinceton-nlp/prolong-data-512K\n\t\n\n[Paper] [HF Collection] [Code]\nProLong (Princeton long-context language models) is a family of long-context models that are continued trained and supervised fine-tuned from Llama-3-8B, with a maximum context window of 512K tokens. Our main ProLong model is one of the best-performing long-context models at the 10B scale (evaluated by HELMET).\nTo train this strong long-context model, we conduct thorough ablations on the long-context pre-training data… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/prolong-data-512K.","downloads":9437,"tags":["language:en","arxiv:2410.02660","region:us","long-context"],"createdAt":"2024-10-03T14:17:26.000Z","key":""},{"_id":"66ffff94008ef654647cefd0","id":"jingyaogong/minimind-v_dataset","author":"jingyaogong","disabled":false,"gated":false,"lastModified":"2026-04-19T10:57:29.000Z","likes":36,"trendingScore":1,"private":false,"sha":"1e279a8b665cb10383451a6af6fd62b9f35bdd79","description":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tⅠ 数据集\n\t\n\n本轮训练用到的图文数据全部来自 ALLaVA-4V 系列。\n相比以往从几份 LLaVA 衍生集拼接得到的数据，ALLaVA-4V 的质量更整齐、中英双语原生对照，细粒度描述也更充分。\n它由两个子源构成：一份是 LAION 里挑出来的高质量图片（自然图像为主），一份是 VFLAN 指令流里挑出来的图片（文档、图表、合成场景居多）。\n\nPretrain（pretrain_i2t.parquet，约 127 万条 / ~64 万张唯一图像）\n\nALLaVA-Caption-LAION-4V 英/中：~47万 + ~44万ALLaVA-Caption-VFLAN-4V 英/中：~19万 + ~17万\n任务形式为\"请描述这张图片\"类的单轮长描述，用于让模型建立视觉 token 到语言 token 的基础对齐。\n\n\nSFT（sft_i2t.parquet，约 290 万条 / ~65 万张唯一图像）\n\nALLaVA-Instruct-LAION-4V 英/中：~47万 + ~47万… See the full description on the dataset page: https://huggingface.co/datasets/jingyaogong/minimind-v_dataset.","downloads":1436,"tags":["task_categories:visual-question-answering","language:zh","language:en","license:apache-2.0","size_categories:n<1K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us","multimodal"],"createdAt":"2024-10-04T14:45:40.000Z","key":""},{"_id":"6701c51b1b322aa32ea674b0","id":"playlogue/playlogue-v1","author":"playlogue","disabled":false,"gated":"auto","lastModified":"2024-10-11T02:56:42.000Z","likes":9,"trendingScore":1,"private":false,"sha":"fe2d2d1298075f4d15c964e5bb45534eecf6e683","description":"\n\t\n\t\t\n\t\tPlaylogue: Dataset and Benchmarks for Analyzing Adult-Child Conversations During Play\n\t\n\nPlaylogue is a first-of-its-kind dataset of naturalistic adult-child conversations with transcripts, speaker information, and speech acts. It is designed to develop and evaluate audio and language models on child-centered speech involving preschool-aged children. For more details, please refer to our paper.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nPlaylogue is a curated… See the full description on the dataset page: https://huggingface.co/datasets/playlogue/playlogue-v1.","downloads":57,"tags":["task_categories:automatic-speech-recognition","task_categories:audio-classification","task_categories:text-classification","annotations_creators:expert-generated","source_datasets:CHILDES","language:en","license:cc-by-nc-sa-3.0","modality:audio","region:us","audio","child speech","speaker diarization","asr","speech acts","dpics"],"createdAt":"2024-10-05T23:00:43.000Z","key":""},{"_id":"6702f25eb4c68c06b6f235c8","id":"Dampfinchen/Creative_Writing_Multiturn","author":"Dampfinchen","disabled":false,"gated":false,"lastModified":"2026-01-23T11:20:56.000Z","likes":33,"trendingScore":1,"private":false,"sha":"642448c0399c9fe10eaa2d5e51b50cd1c3b78a2e","description":"UPDATE 2026: Stronger filtering using a very sophisticated filtering script and new data including a very small subset of https://huggingface.co/datasets/lemon07r/VellumK2T-Fiction-SFT-01 reasoning for thinking with a custom system prompt attached. This is suitable for both instruct non-thinking and thinking models, as I have added a system prompt for these few samples that use the tags <!think!> and </!think!> (without exclamation marks of course). \nThis is a dataset merge of many, many high… See the full description on the dataset page: https://huggingface.co/datasets/Dampfinchen/Creative_Writing_Multiturn.","downloads":613,"tags":["language:en","license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","text","multiturn","creative writing","story","roleplaying","json"],"createdAt":"2024-10-06T20:26:06.000Z","key":""},{"_id":"6703a9b1dfea46624547b361","id":"Sterzhang/PVIT-3M","author":"Sterzhang","disabled":false,"gated":false,"lastModified":"2024-11-02T07:41:57.000Z","likes":20,"trendingScore":1,"private":false,"sha":"68c0ad34851b06e7e408b092c1f8ee1004f6c92b","description":"\n\t\n\t\t\n\t\tPVIT-3M\n\t\n\nThe paper titled \"Personalized Visual Instruction Tuning\" introduces a novel dataset called PVIT-3M. This dataset is specifically designed for tuning MLLMs in the context of personalized visual instruction tasks. The dataset consists of 3 million image-text pairs that aim to improve MLLMs' abilities to generate responses based on personalized visual inputs, making them more tailored and adaptable to individual user needs and preferences.\nHere’s the PVIT-3M statistics:… See the full description on the dataset page: https://huggingface.co/datasets/Sterzhang/PVIT-3M.","downloads":1079,"tags":["task_categories:visual-question-answering","task_categories:image-text-to-text","language:en","license:apache-2.0","size_categories:1M<n<10M","format:json","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2410.07113","region:us","multi-modal","personalized"],"createdAt":"2024-10-07T09:28:17.000Z","key":""},{"_id":"67058fd53c11b9d560eea299","id":"drveronika/x_fake_profile_detection","author":"drveronika","disabled":false,"gated":false,"lastModified":"2024-10-08T20:34:23.000Z","likes":5,"trendingScore":1,"private":false,"sha":"28bc1fbee3aa928ab3d29d9766155a5233465216","description":"\n\t\n\t\t\n\t\tDataset: Detecting Fake Accounts on Social Media Portals—The X Portal Case Study\n\t\n\nThis dataset was created as part of the study focused on detecting fake accounts on the X Portal (formerly known as Twitter). The primary aim of the study was to classify social media accounts using image data and machine learning techniques, offering a novel approach to identifying fake accounts. The dataset includes generated accounts, which were used to train and test a Convolutional Neural Network… See the full description on the dataset page: https://huggingface.co/datasets/drveronika/x_fake_profile_detection.","downloads":44,"tags":["task_categories:image-classification","language:en","license:mit","size_categories:10K<n<100K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us","x","twitter","profile","network","fake","social","media"],"createdAt":"2024-10-08T20:02:29.000Z","key":""},{"_id":"6705961bf8a5cbf7e6963022","id":"TrustAIRLab/in-the-wild-jailbreak-prompts","author":"TrustAIRLab","disabled":false,"gated":false,"lastModified":"2024-11-19T13:45:28.000Z","likes":33,"trendingScore":1,"private":false,"sha":"a10aab8eff1c73165a442d4464dce192bd28b9c5","description":"\n\t\n\t\t\n\t\tIn-The-Wild Jailbreak Prompts on LLMs\n\t\n\nThis is the official repository for the ACM CCS 2024 paper \"Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models by Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang.\nIn this project, employing our new framework JailbreakHub, we conduct the first measurement study on jailbreak prompts in the wild, with 15,140 prompts collected from December 2022 to December 2023 (including 1,405… See the full description on the dataset page: https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts.","downloads":3270,"tags":["task_categories:text-generation","license:mit","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2308.03825","region:us"],"createdAt":"2024-10-08T20:29:15.000Z","key":""},{"_id":"670760996d79c9796255a647","id":"litagin/Galgame_Speech_ASR_16kHz","author":"litagin","disabled":false,"gated":false,"lastModified":"2024-10-14T06:37:25.000Z","likes":46,"trendingScore":1,"private":false,"sha":"3fb86654222b3f0af0f7c332ae6a0ef9752a9451","description":"\n\t\n\t\t\n\t\tDataset Card for Galgame_Speech_ASR_16kHz\n\t\n\n\n[!IMPORTANT]The following rules (in the original repository) must be followed:\n必须遵守GNU General Public License v3.0内的所有协议！附加：禁止商用，本数据集以及使用本数据集训练出来的任何模型都不得用于任何商业行为，如要用于商业用途，请找数据列表内的所有厂商授权（笑），因违反开源协议而出现的任何问题都与本人无关！\n训练出来的模型必须开源，是否在README内引用本数据集由训练者自主决定，不做强制要求。\nEnglish:\nYou must comply with all the terms of the GNU General Public License v3.0!Additional note: Commercial use is prohibited. This dataset and any model trained using this dataset… See the full description on the dataset page: https://huggingface.co/datasets/litagin/Galgame_Speech_ASR_16kHz.","downloads":559,"tags":["task_categories:automatic-speech-recognition","multilinguality:monolingual","language:ja","license:gpl-3.0","size_categories:1M<n<10M","format:webdataset","modality:audio","modality:text","library:datasets","library:webdataset","library:mlcroissant","region:us","speech","audio","text","japanese","anime","voice","visual novel","galgame"],"createdAt":"2024-10-10T05:05:29.000Z","key":""},{"_id":"6707c8a87a737319934442a6","id":"openlanguagedata/flores_plus","author":"openlanguagedata","disabled":false,"gated":"auto","lastModified":"2026-05-23T15:27:24.000Z","likes":150,"trendingScore":1,"private":false,"sha":"7555f1216a8f8c0b4227f0a2cd82de6db2e902d9","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for FLORES+\n\t\n\nFLORES+ is an evaluation benchmark dataset for multilingual machine translation.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nFLORES+ is a multilingual machine translation benchmark released under CC BY-SA 4.0. This dataset was originally released by FAIR researchers at Meta under the name FLORES. Further information about these initial releases can be found in Dataset Sources below. The data is now being managed by OLDI, the Open… See the full description on the dataset page: https://huggingface.co/datasets/openlanguagedata/flores_plus.","downloads":17031,"tags":["task_categories:text-generation","task_categories:translation","annotations_creators:found","language_creators:expert-generated","multilinguality:multilingual","multilinguality:translation","source_datasets:extended|flores","language:ace","language:acm","language:acq","language:aeb","language:af","language:ajp","language:ak","language:als","language:am","language:apc","language:ar","language:ars","language:ary","language:arz","language:as","language:ast","language:awa","language:ayr","language:azb","language:azj","language:ba","language:bm","language:ban","language:be","language:bem","language:bn","language:bho","language:bjn","language:bo","language:bs","language:bug","language:bg","language:ca","language:ceb","language:cs","language:cjk","language:ckb","language:crh","language:cy","language:da","language:de","language:dar","language:dik","language:dyu","language:dz","language:el","language:en","language:eo","language:et","language:eu","language:ee","language:fo","language:fj","language:fi","language:fon","language:fr","language:fur","language:fuv","language:gaz","language:gd","language:ga","language:gl","language:gn","language:gu","language:ht","language:ha","language:he","language:hi","language:hne","language:hr","language:hu","language:hy","language:ig","language:ilo","language:id","language:is","language:it","language:jv","language:ja","language:kab","language:kac","language:kam","language:kn","language:ks","language:ka","language:kk","language:kbp","language:kea","language:khk","language:km","language:ki","language:rw","language:kjh","language:ky","language:kmb","language:kmr","language:knc","language:kg","language:ko","language:lo","language:lij","language:li","language:lld","language:ln","language:lt","language:lmo","language:ltg","language:lb","language:lua","language:lg","language:luo","language:lus","language:lvs","language:mag","language:mai","language:ml","language:mar","language:mfe","language:mhr","language:min","language:mk","language:mt","language:mni","language:mos","language:mi","language:my","language:nl","language:nn","language:nb","language:npi","language:nso","language:nus","language:ny","language:oc","language:ory","language:pag","language:pa","language:pap","language:pbt","language:pes","language:plt","language:pl","language:pt","language:prs","language:quy","language:ro","language:rn","language:ru","language:sg","language:sa","language:sat","language:scn","language:shn","language:si","language:sk","language:sl","language:sm","language:sn","language:sd","language:so","language:st","language:es","language:sc","language:sr","language:ss","language:su","language:sv","language:swh","language:szl","language:ta","language:taq","language:tt","language:te","language:tg","language:tl","language:th","language:ti","language:tpi","language:tn","language:ts","language:tk","language:tum","language:tr","language:tw","language:tzm","language:udm","language:ug","language:uk","language:umb","language:ur","language:uzn","language:uzs","language:vec","language:vi","language:war","language:wo","language:xh","language:ydd","language:yo","language:yue","language:zgh","language:zh","language:zsm","language:zu","license:cc-by-sa-4.0","size_categories:100K<n<1M","format:json","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2207.04672","region:us","text"],"createdAt":"2024-10-10T12:29:28.000Z","key":""},{"_id":"6708938cb6272e48f044a95e","id":"nlile/math_prompt_ti_84v2_results","author":"nlile","disabled":false,"gated":"manual","lastModified":"2024-10-11T02:55:59.000Z","likes":1,"trendingScore":1,"private":false,"sha":"76ac1422ea73a40af050cf588124562ef58e4689","downloads":2,"tags":["size_categories:100K<n<1M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-10-11T02:55:08.000Z","key":""},{"_id":"670c9eb83097b4997463642e","id":"joyfuljune/S16k","author":"joyfuljune","disabled":false,"gated":false,"lastModified":"2026-01-29T12:57:32.000Z","likes":1,"trendingScore":1,"private":false,"sha":"89c8254f3b550c25dd107f00e3ed4dc2ec541cd9","description":"This is a multi-label music emotion recognition dataset that contains more than 16k songs. \nThe dataset was obtained from NetEase Cloud Music, with specific details referenced from the paper https://doi.org/10.1007/s00530-025-01701-z\nThe NPZ file contains the MFCC features extracted by librosa from the middle 30 seconds of all songs. The playlist IDs in the JSON file correspond to the song IDs in the CSV file.\nsongs_9822 is a balanced multi-label subset of the original S16k. It contains 9822… See the full description on the dataset page: https://huggingface.co/datasets/joyfuljune/S16k.","downloads":18,"tags":["license:cc-by-nc-sa-4.0","region:us"],"createdAt":"2024-10-14T04:31:52.000Z","key":""},{"_id":"670d4bb8207a1458e88ab1f6","id":"gretelai/gretel-pii-masking-en-v1","author":"gretelai","disabled":false,"gated":false,"lastModified":"2025-12-17T15:21:27.000Z","likes":44,"trendingScore":1,"private":false,"sha":"e06eb1499ca8d54470f085021cd8e54f9efac7fd","description":"\n\n\n\n\n\t\n\t\t\n\t\tGretel Synthetic Domain-Specific Documents Dataset (English)\n\t\n\nThis dataset is a synthetically generated collection of documents enriched with Personally Identifiable Information (PII) and Protected Health Information (PHI) entities spanning multiple domains. \nCreated using Gretel Navigator with mistral-nemo-2407 as the backend model, it is specifically designed for fine-tuning Gliner models. \nThe dataset contains document passages featuring PII/PHI entities from a wide range of… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1.","downloads":852,"tags":["task_categories:text-classification","task_categories:text-generation","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","library:datadesigner","region:us","datadesigner","synthetic","domain-specific","text","NER"],"createdAt":"2024-10-14T16:50:00.000Z","key":""},{"_id":"670ebd9eba29b3fca231bbd2","id":"kellycyy/daily_dilemmas","author":"kellycyy","disabled":false,"gated":false,"lastModified":"2024-10-15T20:09:22.000Z","likes":10,"trendingScore":1,"private":false,"sha":"372352c4143ec4923c50b4472bd3148a474f7ab8","description":"\n\t\n\t\t\n\t\tDailyDilemmas - Revealing Value Preferences of LLMs with Quandaries of Daily Life\n\t\n\n\n\t\n\t\t\n\t\tLink: Paper\n\t\n\n\n\t\n\t\t\n\t\tDescription of DailyDilemma\n\t\n\n\nDailyDilemma is a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked.\nWe evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions\n\n\n\t\n\t\t\n\t\tDataset details… See the full description on the dataset page: https://huggingface.co/datasets/kellycyy/daily_dilemmas.","downloads":178,"tags":["license:cc-by-4.0","size_categories:10K<n<100K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2410.02683","region:us"],"createdAt":"2024-10-15T19:08:14.000Z","key":""},{"_id":"670f1679f041daeab76fcde9","id":"Virtue-AI-HUB/SecCodePLT","author":"Virtue-AI-HUB","disabled":false,"gated":"auto","lastModified":"2024-10-16T06:03:26.000Z","likes":9,"trendingScore":1,"private":false,"sha":"62efb6f03741aa6ae10afa801a7905f734baa3c1","description":"\n\t\n\t\t\n\t\tSecCodePLT\n\t\n\n\nSecCodePLT is a unified and comprehensive evaluation platform for code GenAIs' risks.\n\n\t\n\t\t\n\t\t1. Dataset Details\n\t\n\n\n\t\n\t\t\n\t\t1.1 Dataset Description\n\t\n\n\n\n\n\n\n\nLanguage(s) (NLP): English\nLicense: MIT\n\n\n\t\n\t\t\n\t\t1.2 Dataset Sources\n\t\n\n\n\n\nRepository: Coming soon\nPaper: https://arxiv.org/pdf/2410.11096\nDemo: https://seccodeplt.github.io/\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t2. Uses\n\t\n\n\n\n\n\t\n\t\t\n\t\n\t\n\t\t2.1 Direct Use\n\t\n\nThis dataset can be used for evaluate the risks of large language models generating… See the full description on the dataset page: https://huggingface.co/datasets/Virtue-AI-HUB/SecCodePLT.","downloads":340,"tags":["task_categories:question-answering","task_categories:text-generation","language:en","license:mit","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2410.11096","region:us","code"],"createdAt":"2024-10-16T01:27:21.000Z","key":""},{"_id":"67101333716702a1e3126a55","id":"RayanAi/Main_teeth_dataset","author":"RayanAi","disabled":false,"gated":false,"lastModified":"2024-10-16T19:37:04.000Z","likes":1,"trendingScore":1,"private":false,"sha":"4b87da38b8a5996756b264c248f7d4b27012410d","downloads":27,"tags":["size_categories:1K<n<10K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us"],"createdAt":"2024-10-16T19:25:39.000Z","key":""},{"_id":"67109a6b07325c4b0ae03fc7","id":"shiertier/behance_json","author":"shiertier","disabled":false,"gated":false,"lastModified":"2024-10-17T11:56:20.000Z","likes":1,"trendingScore":1,"private":false,"sha":"bffa9120c5f941e186e749bfe6283a19378ac2dc","downloads":46,"tags":["license:unknown","size_categories:1M<n<10M","format:json","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","region:us"],"createdAt":"2024-10-17T05:02:35.000Z","key":""},{"_id":"67168507374ebff4a06b53b6","id":"PULSE-ECG/ECGInstruct","author":"PULSE-ECG","disabled":false,"gated":false,"lastModified":"2024-10-28T01:30:35.000Z","likes":19,"trendingScore":1,"private":false,"sha":"264a7a397d1ded4b4b0cfc52324ec8274a2b4800","description":"\n\t\n\t\t\n\t\tECGInstruct\n\t\n\nDataset for paper \"Teach Multimodal LLMs to Comprehend Electrocardiographic Images\".\n🌐 Project Page: https://aimedlab.github.io/PULSE/\n📄 Paper: https://arxiv.org/abs/2410.19008\n🧑‍💻 Code: https://github.com/AIMedLab/PULSE\n🤗 Model: https://huggingface.co/PULSE-ECG/PULSE-7B\n⚖️ ECGBench: https://huggingface.co/datasets/PULSE-ECG/ECGBench\n\n\t\n\t\t\n\t\n\t\n\t\tIntroduction\n\t\n\nECGInstruct is a comprehensive and large-scale instruction-tuning dataset designed for ECG image… See the full description on the dataset page: https://huggingface.co/datasets/PULSE-ECG/ECGInstruct.","downloads":743,"tags":["license:apache-2.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2410.19008","region:us"],"createdAt":"2024-10-21T16:44:55.000Z","key":""},{"_id":"67174a171e9962496ab96788","id":"hlt-lab/voicebench","author":"hlt-lab","disabled":false,"gated":false,"lastModified":"2025-04-19T16:49:38.000Z","likes":12,"trendingScore":1,"private":false,"sha":"b02edcef1330480be3a11bd6f7434ac32f05ad08","description":"\n\t\n\t\t\n\t\tLicense\n\t\n\nThe dataset is available under the Apache 2.0 license.\n\n\t\n\t\t\n\t\tCitation\n\t\n\nIf you use the VoiceBench dataset in your research, please cite the following paper:\n@article{chen2024voicebench,\n  title={VoiceBench: Benchmarking LLM-Based Voice Assistants},\n  author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},\n  journal={arXiv preprint arXiv:2410.17196},\n  year={2024}\n}\n\n","downloads":3038,"tags":["language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2410.17196","region:us"],"createdAt":"2024-10-22T06:45:43.000Z","key":""},{"_id":"6718bd6ddf60ea2270dba427","id":"sarvamai/mmlu-indic","author":"sarvamai","disabled":false,"gated":false,"lastModified":"2025-05-23T09:13:30.000Z","likes":13,"trendingScore":1,"private":false,"sha":"cb8b47d03c971b20b7b43a051f6bf1d54ae801e8","description":"\n\t\n\t\t\n\t\tIndic MMLU Dataset\n\t\n\nA multilingual version of the Massive Multitask Language Understanding (MMLU) benchmark, translated from English into 10 Indian languages.\nThis version contains the translations of the development and test sets only. \n\n\t\n\t\t\n\t\tLanguages Covered\n\t\n\nThe dataset includes translations in the following languages:\n\nBengali (bn)\nGujarati (gu)\nHindi (hi)\nKannada (kn)\nMarathi (mr)\nMalayalam (ml)\nOriya (or)\nPunjabi (pa)\nTamil (ta)\nTelugu (te)\n\n\n\t\n\t\t\n\t\tTask Format\n\t\n\nEach… See the full description on the dataset page: https://huggingface.co/datasets/sarvamai/mmlu-indic.","downloads":228,"tags":["task_categories:question-answering","language:bn","language:en","language:gu","language:hi","language:kn","language:ml","language:mr","language:or","language:pa","language:ta","language:te","license:mit","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-10-23T09:10:05.000Z","key":""},{"_id":"671928371e52d113736171a4","id":"ClimatePolicyRadar/all-document-text-data","author":"ClimatePolicyRadar","disabled":false,"gated":"auto","lastModified":"2025-10-29T22:12:59.000Z","likes":22,"trendingScore":1,"private":false,"sha":"d4542f2fbc906705e7313f8f01992bfade7b16b7","description":"\n\t\n\t\t\n\t\tClimate Policy Radar Open Data\n\t\n\nThis repo contains the full text data of all of the documents from the Climate Policy Radar database (CPR), which is also available at Climate Change Laws of the World (CCLW). \nPlease note that this replaces the Global Stocktake open dataset: that data, including all NDCs and IPCC reports is now a subset of this dataset.\n\n\t\n\t\t\n\t\n\t\n\t\tWhat’s in this dataset\n\t\n\nThis dataset contains two corpus types (groups of the same types or sources of documents) which… See the full description on the dataset page: https://huggingface.co/datasets/ClimatePolicyRadar/all-document-text-data.","downloads":237,"tags":["license:cc-by-4.0","size_categories:10M<n<100M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","doi:10.57967/hf/5426","region:us"],"createdAt":"2024-10-23T16:45:43.000Z","key":""},{"_id":"671b211f5f7cde5ac313dfa0","id":"Tejasva-Maurya/English-Technical-Speech-Dataset","author":"Tejasva-Maurya","disabled":false,"gated":"auto","lastModified":"2024-10-26T21:18:15.000Z","likes":8,"trendingScore":1,"private":false,"sha":"9e5b4e1b3e877ab5661000d4dbd2644671e40b24","description":"\n\t\n\t\t\n\t\tEnglish Technical Speech Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe English Technical Speech Dataset is a curated collection of English technical vocabulary recordings, designed for applications like Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Audio Classification. The dataset includes 11,247 entries and provides audio files, transcriptions, and speaker embeddings to support the development of robust technical language models.\n\nLanguage: English (technical focus)\nTotal… See the full description on the dataset page: https://huggingface.co/datasets/Tejasva-Maurya/English-Technical-Speech-Dataset.","downloads":22,"tags":["size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-10-25T04:39:59.000Z","key":""},{"_id":"671d41db9a9fe27faeb7229f","id":"JacobLinCool/VoiceBank-DEMAND-16k","author":"JacobLinCool","disabled":false,"gated":false,"lastModified":"2024-10-26T19:39:08.000Z","likes":12,"trendingScore":1,"private":false,"sha":"4497db342d7312978c45690591fda86117831940","downloads":1466,"paperswithcode_id":"demand","tags":["task_categories:audio-to-audio","language:en","license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-10-26T19:24:11.000Z","key":""},{"_id":"6720807a5ec8b84bfad21c3a","id":"apoidea/fintabnet-html","author":"apoidea","disabled":false,"gated":false,"lastModified":"2024-11-04T08:40:46.000Z","likes":8,"trendingScore":1,"private":false,"sha":"fefe6a2aa1228734ceebe72d0f868bf7b26310ae","downloads":367,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-10-29T06:28:10.000Z","key":""},{"_id":"67209d2ddc8f614649295e12","id":"BangumiBase/orewasubetewoparrysurugyakukanchigainosekaisaikyouwaboukenshaninaritai","author":"BangumiBase","disabled":false,"gated":false,"lastModified":"2024-10-29T11:36:17.000Z","likes":1,"trendingScore":1,"private":false,"sha":"9787c820ebe7447693149feb057e4caee5fbe1e5","description":"\n\t\n\t\t\n\t\tBangumi Image Base of Ore Wa Subete Wo \"parry\" Suru: Gyaku Kanchigai No Sekai Saikyou Wa Boukensha Ni Naritai\n\t\n\nThis is the image base of bangumi Ore wa Subete wo \"Parry\" suru: Gyaku Kanchigai no Sekai Saikyou wa Boukensha ni Naritai, we detected 55 characters, 4726 images in total. The full dataset is here.\nPlease note that these image bases are not guaranteed to be 100% cleaned, they may be noisy actual. If you intend to manually train models using this dataset, we recommend… See the full description on the dataset page: https://huggingface.co/datasets/BangumiBase/orewasubetewoparrysurugyakukanchigainosekaisaikyouwaboukenshaninaritai.","downloads":1056,"tags":["license:mit","size_categories:1K<n<10K","modality:image","modality:text","region:us","art"],"createdAt":"2024-10-29T08:30:37.000Z","key":""},{"_id":"6720adc25ec8b84bfae2a63c","id":"TucanoBR/wikipedia-PT","author":"TucanoBR","disabled":false,"gated":false,"lastModified":"2024-11-02T00:44:35.000Z","likes":5,"trendingScore":1,"private":false,"sha":"81aeee176c2cfe9b37b2d39eaba4c04d24dfa60c","description":"\n\t\n\t\t\n\t\tWikipedia-PT\n\t\n\n\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThe Portuguese portion of the Wikipedia dataset.\n\n\t\n\t\t\n\t\tSupported Tasks and Leaderboards\n\t\n\nThe dataset is generally used for Language Modeling.\n\n\t\n\t\t\n\t\tLanguages\n\t\n\nPortuguese\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\n\t\n\t\t\n\t\tData Instances\n\t\n\nAn example looks as follows:\n{\n 'text': 'Abril é o quarto mês...'\n}\n\n\n\t\n\t\t\n\t\tData Fields\n\t\n\n\ntext (str): Text content of the article.\n\n\n\t\n\t\t\n\t\tData Splits\n\t\n\nAll configurations contain a single train split.… See the full description on the dataset page: https://huggingface.co/datasets/TucanoBR/wikipedia-PT.","downloads":223,"tags":["task_categories:text-generation","language:pt","license:cc-by-sa-3.0","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","portuguese"],"createdAt":"2024-10-29T09:41:22.000Z","key":""},{"_id":"6721bc70beef47423f707cbb","id":"zwhe99/amc23","author":"zwhe99","disabled":false,"gated":false,"lastModified":"2024-10-30T04:56:20.000Z","likes":2,"trendingScore":1,"private":false,"sha":"f9810c0439cd3c670ec885d328a2f06a87f3694a","downloads":4798,"tags":["size_categories:n<1K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-10-30T04:56:16.000Z","key":""},{"_id":"672953fe02904d477859ddc1","id":"moondream/synthetic-gauges-v3","author":"moondream","disabled":false,"gated":false,"lastModified":"2024-11-04T23:24:58.000Z","likes":1,"trendingScore":1,"private":false,"sha":"daeb50c4afa094ccb71d41733831910fc139dc34","downloads":8,"tags":["size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-04T23:08:46.000Z","key":""},{"_id":"672d8bf4bde669ec7e63ba72","id":"allenai/tulu-3-sft-mixture","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-12-02T19:48:33.000Z","likes":252,"trendingScore":1,"private":false,"sha":"b14afda60f1bbebe55d5d2fa1e4df5042f97f8be","description":"\n\n\n\t\n\t\t\n\t\tTulu 3 SFT Mixture\n\t\n\nNote that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact.\nThe Tulu 3 SFT mixture was used to train the Tulu 3 series of models.\nIt contains 939,344 samples from the following sets:\n\nCoCoNot (ODC-BY-1.0), 10,983 prompts (Brahman et al., 2024)\nFLAN v2 via ai2-adapt-dev/flan_v2_converted, 89,982 prompts (Longpre et… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture.","downloads":17738,"tags":["task_categories:other","annotations_creators:crowdsourced","annotations_creators:expert-generated","annotations_creators:machine-generated","multilinguality:multilingual","source_datasets:allenai/coconot","source_datasets:ai2-adapt-dev/flan_v2_converted","source_datasets:HuggingFaceH4/no_robots","source_datasets:OpenAssistant/oasst1","source_datasets:allenai/tulu-3-personas-math","source_datasets:allenai/tulu-3-sft-personas-math-grade","source_datasets:allenai/tulu-3-sft-personas-code","source_datasets:allenai/tulu-3-personas-algebra","source_datasets:allenai/tulu-3-sft-personas-instruction-following","source_datasets:AI-MO/NuminaMath-TIR","source_datasets:allenai/wildguardmix","source_datasets:allenai/wildjailbreak","source_datasets:allenai/tulu-3-hard-coded","source_datasets:CohereForAI/aya_dataset","source_datasets:allenai/WildChat-1M","source_datasets:LipengCS/Table-GPT","source_datasets:allenai/SciRIFF","source_datasets:theblackcat102/evol-codealpaca-v1","language:amh","language:arb","language:ary","language:ars","language:acq","language:arz","language:apc","language:ben","language:ceb","language:dan","language:deu","language:ell","language:eng","language:eus","language:fil","language:fin","language:fra","language:gle","language:guj","language:hat","language:hau","language:hin","language:hun","language:ibo","language:ind","language:ita","language:jav","language:jpn","language:kan","language:kir","language:kor","language:kur","language:lit","language:mal","language:mar","language:mlg","language:msa","language:mya","language:nep","language:nld","language:nso","language:nya","language:pan","language:pes","language:pol","language:por","language:pus","language:rus","language:sin","language:sna","language:snd","language:som","language:spa","language:sqi","language:srp","language:sun","language:swa","language:swe","language:tam","language:tel","language:tha","language:tur","language:ukr","language:urd","language:vie","language:wol","language:xho","language:yor","language:zho","language:zul","license:odc-by","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2024-11-08T03:56:36.000Z","key":""},{"_id":"672e43b562371d59e7202334","id":"OpenCoder-LLM/opc-sft-stage1","author":"OpenCoder-LLM","disabled":false,"gated":false,"lastModified":"2024-11-24T06:40:44.000Z","likes":75,"trendingScore":1,"private":false,"sha":"1bcab575f5e2d1c1fd6652720418524c27b3d58b","description":"\n\n\t\n\t\t\n\t\tOpenCoder Dataset\n\t\n\nThe OpenCoder dataset is composed of the following datasets:\n\nopc-sft-stage1: the sft data used for opencoder sft-stage1 <-- you are here\nopc-sft-stage2: the sft data used for opencoder sft-stage2\nopc-annealing-corpus: the synthetic data & algorithmic corpus used for opencoder annealing\nopc-fineweb-code-corpus: the code-related page recalled from fineweb\nopc-fineweb-math-corpus: the math-related page recalled from finewebrefineCode-code-corpus-meta: the meta-data… See the full description on the dataset page: https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1.","downloads":2365,"tags":["license:mit","size_categories:1M<n<10M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2411.04905","region:us"],"createdAt":"2024-11-08T17:00:37.000Z","key":""},{"_id":"672f9b7ed7f4171f3751ffb3","id":"OpenCoder-LLM/opc-fineweb-code-corpus","author":"OpenCoder-LLM","disabled":false,"gated":false,"lastModified":"2024-11-24T06:41:46.000Z","likes":54,"trendingScore":1,"private":false,"sha":"9e8e48e666c226294d6f9e6c2e13f2c84c1c06f3","description":"\n\n\t\n\t\t\n\t\tOpenCoder Dataset\n\t\n\nThe OpenCoder dataset is composed of the following datasets:\n\nopc-sft-stage1: the sft data used for opencoder sft-stage1\nopc-sft-stage2: the sft data used for opencoder sft-stage2\nopc-annealing-corpus: the synthetic data & algorithmic corpus used for opencoder annealing\nopc-fineweb-code-corpus: the code-related page recalled from fineweb <-- you are here\nopc-fineweb-math-corpus: the math-related page recalled from finewebrefineCode-code-corpus-meta: the meta-data… See the full description on the dataset page: https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-code-corpus.","downloads":3748,"tags":["license:mit","size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2411.04905","region:us"],"createdAt":"2024-11-09T17:27:26.000Z","key":""},{"_id":"6730c462da4b02fafbb06674","id":"hal-utokyo/Manga109-s","author":"hal-utokyo","disabled":false,"gated":"auto","lastModified":"2026-06-04T05:45:10.000Z","likes":33,"trendingScore":1,"private":false,"sha":"f276b45293be4f4ca92f313a638d26dbcb50fbc2","description":"\n\t\n\t\t\n\t\n\t\n\t\tManga109 Homepage\n\t\n\nhttps://manga109.github.io/manga109-project-website \n\n\t\n\t\t\n\nAcademic Use\nNon-academic Use\n\n\n\t\t\nAcademic Institutions (e.g. University)\nManga109\nManga109-s\n\n\nOthers (e.g. Company)\nManga109-s\nManga109-s\n\n\n\t\n\n\n\t\n\t\t\n\t\n\t\n\t\t⚠️ Important Notice ⚠️\n\t\n\nStarting from May 21, 2026, we distribute Manga109-s-v2026, an updated version of the dataset in which the annotations were revised and improved in 2026, following the methodology this paper.\nTo unify image sizes across… See the full description on the dataset page: https://huggingface.co/datasets/hal-utokyo/Manga109-s.","downloads":136,"tags":["language:ja","license:other","arxiv:2605.21182","region:us"],"createdAt":"2024-11-10T14:34:10.000Z","key":""},{"_id":"6732589af21b5584072d297b","id":"gretelai/gretel-text-to-python-fintech-en-v1","author":"gretelai","disabled":false,"gated":false,"lastModified":"2024-11-11T20:13:57.000Z","likes":18,"trendingScore":1,"private":false,"sha":"a126172a7ed9017932fecc9aad04ffd6f2bdb9cf","description":"\n\t\n\t\t\n\t\tGretel Synthetic Text-to-Python Dataset for FinTech\n\t\n\nThis dataset is a synthetically generated collection of natural language prompts paired with their corresponding Python code snippets, specifically tailored for the FinTech industry. \nCreated using Gretel Navigator's Data Designer, with mistral-nemo-2407 and Qwen/Qwen2.5-Coder-7B as the backend models, it aims to bridge the gap between natural language inputs and high-quality Python code, empowering professionals to implement… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-text-to-python-fintech-en-v1.","downloads":85,"tags":["task_categories:text-generation","language:en","license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","synthetic","domain-specific","text-to-code","fintech"],"createdAt":"2024-11-11T19:18:50.000Z","key":""},{"_id":"67333902d741f752b8483d42","id":"OpenCoder-LLM/opc-annealing-corpus","author":"OpenCoder-LLM","disabled":false,"gated":false,"lastModified":"2025-05-29T20:09:06.000Z","likes":44,"trendingScore":1,"private":false,"sha":"078e4bc1d39725cccc18514d2e3278a3fe191110","description":"\n\n\t\n\t\t\n\t\tOpenCoder Dataset\n\t\n\nThe OpenCoder dataset is composed of the following datasets:\n\nopc-sft-stage1: the sft data used for opencoder sft-stage1\nopc-sft-stage2: the sft data used for opencoder sft-stage2\nopc-annealing-corpus: the synthetic data & algorithmic corpus used for opencoder annealing <-- you are here\nfineweb-code-corpus: the code-related page recalled from fineweb\nfineweb-math-corpus: the math-related page recalled from finewebrefineCode-code-corpus-meta: the meta-data of… See the full description on the dataset page: https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus.","downloads":1234,"tags":["license:odc-by","size_categories:10M<n<100M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","arxiv:2411.04905","region:us"],"createdAt":"2024-11-12T11:16:18.000Z","key":""},{"_id":"67335bb8f014ee49558ef3fe","id":"PleIAs/common_corpus","author":"PleIAs","disabled":false,"gated":false,"lastModified":"2026-05-06T00:28:17.000Z","likes":404,"trendingScore":1,"private":false,"sha":"307910e4c5d040d6f318e6edf2a2b97849155771","description":"\n\t\n\t\t\n\t\tCommon Corpus\n\t\n\n\n  Full paper - ICLR 2026 oral\n\n\nCommon Corpus is the largest open licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners.\nCommon Corpus differs from existing open datasets in that it is:\n\nTruly Open: contains only data that is either… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.","downloads":91634,"tags":["language:en","language:fr","language:de","language:zh","language:it","language:es","language:ja","language:pl","language:la","language:nl","language:ru","language:ar","language:ko","size_categories:10K<n<100K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","arxiv:2410.22587","region:us"],"createdAt":"2024-11-12T13:44:24.000Z","key":""},{"_id":"6734e9a8505981a794b3b459","id":"llm-jp/magpie-sft-v1.0","author":"llm-jp","disabled":false,"gated":false,"lastModified":"2024-11-13T18:54:02.000Z","likes":18,"trendingScore":1,"private":false,"sha":"4f949e18cfa01d99abe6f92e5629157d69737e6d","description":"\n\t\n\t\t\n\t\tmagpie-sft-v1.0\n\t\n\nThis repository provides an instruction-tuning dataset developed by LLM-jp, a collaborative project launched in Japan.\nThis is a dataset of instruction and response pairs created using the Magpie method.\ncyberagent/calm3-22b-chat was used for generating the instructions, and Qwen/Qwen2.5-32B-Instruct was used for generating the responses.\n\n\t\n\t\t\n\t\n\t\n\t\tSend Questions to\n\t\n\nllm-jp(at)nii.ac.jp\n\n\t\n\t\t\n\t\n\t\n\t\tModel Card Authors\n\t\n\nThe names are listed in alphabetical order.… See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/magpie-sft-v1.0.","downloads":207,"tags":["task_categories:text-generation","language:ja","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2406.08464","region:us"],"createdAt":"2024-11-13T18:02:16.000Z","key":""},{"_id":"6736f04b00df53329c422905","id":"OpenCoder-LLM/RefineCode-code-corpus-meta","author":"OpenCoder-LLM","disabled":false,"gated":false,"lastModified":"2024-11-15T15:58:07.000Z","likes":26,"trendingScore":1,"private":false,"sha":"017558900665343ca563733212623e37606167ab","description":"This dataset consists of meta information (including the repository name and file path) of the raw code data from RefineCode. You can collect those files referring to this metadata and reproduce RefineCode!\nNote: Currently, we have uploaded the meta data covered by The Stack V2 (About 50% file volume). Due to complex legal considerations, we are unable to provide the complete source code currently. We are working hard to make the remaining part available.\n\nRefineCode is a high-quality… See the full description on the dataset page: https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta.","downloads":1040,"tags":["license:mit","size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-15T06:55:07.000Z","key":""},{"_id":"67374c18c32c765810f748f6","id":"HuggingFaceH4/MATH-500","author":"HuggingFaceH4","disabled":false,"gated":false,"lastModified":"2025-12-15T11:01:40.000Z","likes":317,"trendingScore":1,"private":false,"sha":"6e4ed1a2a79af7d8630a6b768ec859cb5af4d3be","description":"\n\t\n\t\t\n\t\n\t\n\t\tDataset Card for MATH-500\n\t\n\n\n\nThis dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits\n","downloads":127258,"tags":["task_categories:text-generation","language:en","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2024-11-15T13:26:48.000Z","key":""},{"_id":"6738b65d6c12c4b98b2dfda0","id":"Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Math","author":"Josephgflowers","disabled":false,"gated":false,"lastModified":"2024-11-16T16:18:45.000Z","likes":1,"trendingScore":1,"private":false,"sha":"42beaaf8229a3f26bd17a2d9cb80ee2ff869348d","description":"Filtered for a math focus.\nScript used to create the dataset:\nhttps://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Math/resolve/main/find-math-fine.py\n","downloads":11,"tags":["license:odc-by","size_categories:100K<n<1M","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-16T15:12:29.000Z","key":""},{"_id":"67390bdfd566e6d2a8f9eb12","id":"jrobador/mathinstruct_es","author":"jrobador","disabled":false,"gated":false,"lastModified":"2024-11-16T21:17:59.000Z","likes":1,"trendingScore":1,"private":false,"sha":"16a38ca372eb188ff06663f522001da1ed1ee8e6","downloads":27,"tags":["license:mit","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-16T21:17:19.000Z","key":""},{"_id":"673a1149a7a311f5bed5c624","id":"HuggingFaceTB/smoltalk","author":"HuggingFaceTB","disabled":false,"gated":false,"lastModified":"2025-02-10T16:36:16.000Z","likes":415,"trendingScore":1,"private":false,"sha":"5feaf2fd3ffca7c237fc38d1861bc30365d48ffa","description":"\n\t\n\t\t\n\t\tSmolTalk\n\t\n\n\n\n\t\n\t\t\n\t\tDataset description\n\t\n\nThis is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build SmolLM2-Instruct family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737\nDuring the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smoltalk.","downloads":18059,"tags":["language:en","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2502.02737","region:us","synthetic"],"createdAt":"2024-11-17T15:52:41.000Z","key":""},{"_id":"673d2860d092204f7b8871c2","id":"atlasia/Atlaset","author":"atlasia","disabled":false,"gated":"auto","lastModified":"2025-03-31T23:04:58.000Z","likes":32,"trendingScore":1,"private":false,"sha":"9c62c3328c10a8a6d22b6fe22dacde29d6b1af30","description":"\n\t\n\t\t\n\t\tAtlaset: A Curated Dataset of Moroccan Darija\n\t\n\nThis dataset is a comprehensive, carefully curated collection of text data specifically for Moroccan Darija, the Arabic dialect spoken in Morocco. It combines various sources to provide a diverse and accurate representation of the language. \nThis dataset was curated by Abdelaziz Bounhar and is particularly suited for tasks such as:\n\nLearning word embeddings for Moroccan Darija\nTraining NLP models for tasks like language modeling, text… See the full description on the dataset page: https://huggingface.co/datasets/atlasia/Atlaset.","downloads":163,"tags":["language:ary","language:ar","size_categories:1M<n<10M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-20T00:08:00.000Z","key":""},{"_id":"673dcb196045033e62840114","id":"HiTZ/composite_corpus_eu_v2.1","author":"HiTZ","disabled":false,"gated":false,"lastModified":"2024-12-19T16:23:28.000Z","likes":3,"trendingScore":1,"private":false,"sha":"d9018862fe61c3070000a0c5da011cfc33b0f44b","description":"\n\t\n\t\t\n\t\tComposite dataset for Basque made from public available data\n\t\n\nThis dataset is composed of the following public available data:\n\n\t\n\t\t\n\t\tTrain split:\n\t\n\nThe train split is composed of the following datasets combined:\n\nmozilla-foundation/common_voice_18_0/eu: \"validated\" split removing \"test_cv\" and \"dev_cv\" split's sentences. (validated split contains official train + dev + test splits and more unique data)\ngttsehu/basque_parliament_1/eu: \"train_clean\" split removing some of the… See the full description on the dataset page: https://huggingface.co/datasets/HiTZ/composite_corpus_eu_v2.1.","downloads":760,"tags":["task_categories:automatic-speech-recognition","language:eu","license:cc-by-4.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","asr","stt","dataset"],"createdAt":"2024-11-20T11:42:17.000Z","key":""},{"_id":"673e5d00b82207ded38c2624","id":"nroggendorff/dictionary","author":"nroggendorff","disabled":false,"gated":false,"lastModified":"2024-11-20T23:28:36.000Z","likes":10,"trendingScore":1,"private":false,"sha":"aed3e55ad4805b0a003bd98bddcedc99eb618670","downloads":33,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-20T22:04:48.000Z","key":""},{"_id":"673f18f2a7cf3894a6ae4721","id":"AI-MO/olympiads","author":"AI-MO","disabled":false,"gated":false,"lastModified":"2025-11-06T02:36:13.000Z","likes":6,"trendingScore":1,"private":false,"sha":"5486690cc16bc362d45b6db75bdd5c7a709ca574","description":"\n\t\n\t\t\n\t\tAI-MO Olympiad Reference Dataset\n\t\n\nThis dataset contains a structured collection of Olympiad problems and their solutions, \norganized by competition. Contains high quality data, prioritizing \"official\" solutions to problems.\n\n\t\n\t\t\n\t\tStructure\n\t\n\n<competition name>/    # Problems and solutions from the International Mathematical Olympiad\n├── raw/               # Raw problem/solution statements (.pdf)\n│   ├── file1.pdf\n│   ├── file2.pdf\n├── download_script/   # the scripts used to… See the full description on the dataset page: https://huggingface.co/datasets/AI-MO/olympiads.","downloads":36152,"tags":["modality:document","region:us"],"createdAt":"2024-11-21T11:26:42.000Z","key":""},{"_id":"673f249163f0f2627e672a71","id":"unsloth/LaTeX_OCR","author":"unsloth","disabled":false,"gated":false,"lastModified":"2024-11-21T12:27:50.000Z","likes":83,"trendingScore":1,"private":false,"sha":"4da395c8a0253f4f30983cf08f2480f9bafbd561","description":"1% sampled from https://huggingface.co/datasets/linxy/LaTeX_OCR\n","downloads":4493,"tags":["license:apache-2.0","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-21T12:16:17.000Z","key":""},{"_id":"673f3bf0039b1ab3a8155e23","id":"amaai-lab/JamendoMaxCaps","author":"amaai-lab","disabled":false,"gated":false,"lastModified":"2026-05-25T13:16:57.000Z","likes":24,"trendingScore":1,"private":false,"sha":"13090d785d071dcd617636deaa4a9775936ac995","description":"\n\t\n\t\t\n\t\tJamendoMaxCaps Dataset\n\t\n\n\nJamendoMaxCaps is a large-scale dataset of over 362,000 instrumental tracks sourced from the Jamendo platform. It includes generated music captions and original metadata. Additionally, we introduce a retrieval system that utilizes both musical features and metadata to identify similar songs, which are then used to impute missing metadata via a local large language model (LLLM). \nThis dataset facilitates research in:\n\nMusic-language understanding\nMusic… See the full description on the dataset page: https://huggingface.co/datasets/amaai-lab/JamendoMaxCaps.","downloads":2988,"tags":["license:cc-by-sa-3.0","size_categories:100K<n<1M","format:parquet","modality:audio","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2502.07461","region:us"],"createdAt":"2024-11-21T13:56:00.000Z","key":""},{"_id":"67404e9bfc1b0f0d9226f1d7","id":"X779/Danbooruwildcards","author":"X779","disabled":false,"gated":false,"lastModified":"2024-11-23T17:04:47.000Z","likes":13,"trendingScore":1,"private":false,"sha":"989968c0c8a28d023523de6b577a7699422a798f","description":"This is a set of wildcards for danbooru tags.\nArtist：Prompts for random artist styles, covering approximately 0.6M different artists.Please select the appropriate version of the collection, ranging from 128 to 5000, based on the model's capabilities.The full version is not recommended for use as it includes too many artists with only one image on danbooru or other websites. Almost no model can generate a style that corresponds to these artists .\nCharacters：\"Characters\" is a set of wildcards… See the full description on the dataset page: https://huggingface.co/datasets/X779/Danbooruwildcards.","downloads":45971,"tags":["task_categories:text-generation","language:en","license:other","size_categories:10M<n<100M","format:text","modality:text","library:datasets","library:mlcroissant","region:us"],"createdAt":"2024-11-22T09:27:55.000Z","key":""},{"_id":"674572cb87fb2fed655028ac","id":"ivrit-ai/crowd-transcribe-v5","author":"ivrit-ai","disabled":false,"gated":"auto","lastModified":"2025-11-19T10:05:45.000Z","likes":12,"trendingScore":1,"private":false,"sha":"6059c32f252bd79058975ffc3323d8c53d7427dc","description":"\n\t\n\t\t\n\t\tLicense\n\t\n\nThe dataset is released under the ivrit.ai License, which enables broad research and commercial use.\n\nFull license: https://www.ivrit.ai/en/the-license/\nFAQs: https://www.ivrit.ai/en/license-faqs/\n\n","downloads":498,"tags":["license:other","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-26T07:03:39.000Z","key":""},{"_id":"6745b589c17df8407a2118cd","id":"Abhishekcr448/Hinglish-Everyday-Conversations-1M","author":"Abhishekcr448","disabled":false,"gated":false,"lastModified":"2024-11-26T14:18:02.000Z","likes":18,"trendingScore":1,"private":false,"sha":"cf040b1b80857c6038eb73ddad441f0b435e02a6","description":"\n\t\n\t\t\n\t\tDataset Card for Hinglish Everyday Conversations Dataset\n\t\n\nA synthetically created Hinglish-based dataset of 2 columns where every row represents a unique conversation between 2 people in Hinglish about Everyday Life Topics.\n\n\t\n\t\t\n\t\tUse Model\n\t\n\nAccess the model made using this dataset: Tiny-Hinglish-Chat-21M\nFor more information about this model, its training process, or related resources, you can check the GitHub repository Tiny-Hinglish-Chat-21M-Scripts.\n\n\t\n\t\t\n\t\tDataset Details… See the full description on the dataset page: https://huggingface.co/datasets/Abhishekcr448/Hinglish-Everyday-Conversations-1M.","downloads":727,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:1M<n<10M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","Hinglish","Everyday-Conversations"],"createdAt":"2024-11-26T11:48:25.000Z","key":""},{"_id":"674745ee7bac5607185c1485","id":"allenai/pixmo-cap","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-11-27T22:44:36.000Z","likes":42,"trendingScore":1,"private":false,"sha":"edce6390d9d5be6c8db0d863fbe62718c88988a4","description":"\n\t\n\t\t\n\t\tPixMo-Cap\n\t\n\nPixMo-Cap is a dataset of very long (roughly 200 words on average), detailed captions.\nIt can be used to pre-train and fine-tune vision-language models. \nPixMo-Cap was created by recording annotators speaking about an image for 60-90 seconds and then using the Claude large language model to turn the audio transcripts(s) into a long caption. \nThe audio transcripts are also included.\nPixMo-Cap is part of the PixMo dataset collection and was used to train the Molmo family of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/pixmo-cap.","downloads":678,"tags":["task_categories:image-to-text","license:odc-by","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-27T16:16:46.000Z","key":""},{"_id":"6747467026fe781354769700","id":"allenai/pixmo-points","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-11-27T22:12:24.000Z","likes":46,"trendingScore":1,"private":false,"sha":"2b5c6931e790e00ae00d4a2857e5f95d88f09a66","description":"\n\t\n\t\t\n\t\tPixMo-Points\n\t\n\nPixMo-Points is a dataset of images paired with referring expressions and points marking the locations the\nreferring expression refers to in the image. It was collected using human annotators and contains a diverse \nrange of points and expressions, with many high-frequency (10+) expressions.\nPixMo-Points is a part of the PixMo dataset collection and was used to \nprovide the pointing capabilities of the Molmo family of models\nQuick links:\n\n📃 Paper\n🎥 Blog with Videos… See the full description on the dataset page: https://huggingface.co/datasets/allenai/pixmo-points.","downloads":848,"tags":["license:odc-by","size_categories:1M<n<10M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-27T16:18:56.000Z","key":""},{"_id":"67474ca29027864254acb6b8","id":"allenai/pixmo-point-explanations","author":"allenai","disabled":false,"gated":false,"lastModified":"2024-12-05T18:45:24.000Z","likes":12,"trendingScore":1,"private":false,"sha":"08a566fa00747e4c1c7e8481c350763b469c209c","description":"\n\t\n\t\t\n\t\tPixMo-Point-Explanations\n\t\n\nPixMo-Point-Explanations is a dataset of images, questions, and answers with explanations that can include in-line points that refer to parts of the image.\nIt can be used to train vison language models to respond to questions through a mixture of text and points.\nPixMo-Point-Explanations is part of the PixMo dataset collection and was used to train the Molmo family of models\nWe consider this dataset experimental, while these explanations can be very… See the full description on the dataset page: https://huggingface.co/datasets/allenai/pixmo-point-explanations.","downloads":114,"tags":["task_categories:visual-question-answering","language:en","license:odc-by","size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-27T16:45:22.000Z","key":""},{"_id":"6747510080aebd7436aa497d","id":"alpindale/two-million-bluesky-posts","author":"alpindale","disabled":false,"gated":false,"lastModified":"2024-11-28T06:38:41.000Z","likes":203,"trendingScore":1,"private":false,"sha":"ba52d3e2f7f56701f07dca1c7eb474ab4b8aa1b2","description":"\n\t\n\t\t\n\t\t2 Million Bluesky Posts\n\t\n\nThis dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.\nThe with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model.\nDataset Details\nDataset Description\nThis dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose… See the full description on the dataset page: https://huggingface.co/datasets/alpindale/two-million-bluesky-posts.","downloads":628,"tags":["language:en","license:apache-2.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","region:us","bluesky"],"createdAt":"2024-11-27T17:04:00.000Z","key":""},{"_id":"67476118261fd5af601c7f55","id":"Lipsrow/ruwiki_cleaned","author":"Lipsrow","disabled":false,"gated":false,"lastModified":"2024-11-29T12:33:52.000Z","likes":2,"trendingScore":1,"private":false,"sha":"3d44f228418fb8e225065f40737d0270c2d22808","downloads":24,"tags":["license:mit","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-11-27T18:12:40.000Z","key":""},{"_id":"67480a6432d7dafb412b1ff5","id":"BoKelvin/GEMeX-VQA","author":"BoKelvin","disabled":false,"gated":false,"lastModified":"2024-12-01T14:11:28.000Z","likes":17,"trendingScore":1,"private":false,"sha":"9406e417a3a27ea434f9de512d93fd10119636bb","downloads":118,"tags":["region:us"],"createdAt":"2024-11-28T06:15:00.000Z","key":""},{"_id":"674c8f347bf5e3e90b6a237b","id":"THUMedInfo/RareArena","author":"THUMedInfo","disabled":false,"gated":false,"lastModified":"2026-02-26T06:22:56.000Z","likes":6,"trendingScore":1,"private":false,"sha":"f551879efbf6624306bd53a20353b357b4d469b2","description":"\n\t\n\t\t\n\t\tRareArena\n\t\n\nA Comprehensive Rare Disease Diagnostic Dataset with nearly 50,000 patients covering more than 4000 diseases.\nFor the reproduction and evaluation script, see our Github.\n\n\t\n\t\t\n\t\tData Collection\n\t\n\nWe build our work upon PMC-Patients, a large-scale patient summary dataset sourced from PMC case reports, and we use GPT-4o for all data processing.\nTo be specific, we first filter cases focusing on rare disease diagnoses from PMC-Patients, and extract their ground-truth… See the full description on the dataset page: https://huggingface.co/datasets/THUMedInfo/RareArena.","downloads":89,"tags":["task_categories:question-answering","language:en","license:cc-by-nc-sa-4.0","size_categories:10K<n<100K","region:us","rare_disease","diagnosis"],"createdAt":"2024-12-01T16:30:44.000Z","key":""},{"_id":"674f02c73ea1f06a03f33549","id":"YkiWu/EmbodiedOcc-ScanNet","author":"YkiWu","disabled":false,"gated":false,"lastModified":"2025-08-27T08:14:33.000Z","likes":1,"trendingScore":1,"private":false,"sha":"936b392264c0308640c6284e8e9c24f834e5d2ac","description":"This repository contains the EmbodiedOcc-ScanNet dataset, which is a reorganized benchmark based on local annotations, designed to facilitate the evaluation of the embodied 3D occupancy prediction task. It accompanies the paper EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding.\nProject page: https://ykiwu.github.io/EmbodiedOcc/\nCode: https://github.com/YkiWu/EmbodiedOcc\n\n\t\n\t\t\n\t\n\t\n\t\tPaper Abstract\n\t\n\n3D occupancy prediction provides a comprehensive… See the full description on the dataset page: https://huggingface.co/datasets/YkiWu/EmbodiedOcc-ScanNet.","downloads":70,"tags":["task_categories:image-to-3d","license:apache-2.0","arxiv:2412.04380","region:us","3d-occupancy-prediction","robotics","scene-understanding","computer-vision"],"createdAt":"2024-12-03T13:08:23.000Z","key":""},{"_id":"67502e59ba961e12f0b0bec7","id":"MultimodalUniverse/legacysurvey","author":"MultimodalUniverse","disabled":false,"gated":false,"lastModified":"2024-12-04T11:00:35.000Z","likes":6,"trendingScore":1,"private":false,"sha":"112bddb46da031de2807ced8fe28fb79ed13b515","description":"---\ndescription: 'Image dataset from Legacy Survey DR10\n\n  '\nhomepage: https://www.legacysurvey.org/dr10/\nversion: 1.0.0\ncitation: \"% % ACKNOWLEDGEMENTS\\n% Data Release 10 (DR10) is the tenth public data  \\ release of the Legacy Surveys.\\n% \\n% When using data from the Legacy Surveys  \\ in papers, please use the following acknowledgment:\\n% \\n% The Legacy Surveys  \\ consist of three individual and complementary projects: the Dark Energy Camera  \\ Legacy Survey (DECaLS; Proposal ID #2014B-0404;… See the full description on the dataset page: https://huggingface.co/datasets/MultimodalUniverse/legacysurvey.","downloads":5291,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:1804.08657","region:us"],"createdAt":"2024-12-04T10:26:33.000Z","key":""},{"_id":"67511c0b9c31de7f912002f9","id":"Maxwell-Jia/AIME_2024","author":"Maxwell-Jia","disabled":false,"gated":false,"lastModified":"2025-02-18T06:39:19.000Z","likes":84,"trendingScore":1,"private":false,"sha":"8d88b2876a82a080e2f172cc9b25d0d9d2cb4792","description":"\n\t\n\t\t\n\t\tAIME 2024 Dataset\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nThis dataset contains problems from the American Invitational Mathematics Examination (AIME) 2024. AIME is a prestigious high school mathematics competition known for its challenging mathematical problems.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\nFormat: JSONL\nSize: 30 records\nSource: AIME 2024 I & II\nLanguage: English\n\n\n\t\n\t\t\n\t\tData Fields\n\t\n\nEach record contains the following fields:\n\nID: Problem identifier (e.g., \"2024-I-1\" represents Problem 1… See the full description on the dataset page: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024.","downloads":28149,"tags":["task_categories:text-generation","language:en","license:mit","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","explanation-generation"],"createdAt":"2024-12-05T03:20:43.000Z","key":""},{"_id":"67518700ea42b90baa11b4f8","id":"mwirth-epo/cpc-classification-data","author":"mwirth-epo","disabled":false,"gated":false,"lastModified":"2025-05-07T12:38:22.000Z","likes":1,"trendingScore":1,"private":false,"sha":"8c9af4b6807e2b32d8823ca12f1ad75d3d08c13c","description":"\n\t\n\t\t\n\t\tCPC classification datasets\n\t\n\nThese datasets have been used to train the CPC (Cooperative Patent Classification) classification models mentioned in the article Hähnke, V. D., Wéry, A., Wirth, M., & Klenner-Bajaja, A. (2025). Encoder models at the European Patent Office: Pre-training and use cases. World Patent Information, 81, 102360. https://doi.org/10.1016/j.wpi.2025.102360.\nColumns:\n\npublication_number: the patent publication number, the content of the publication can be looked up… See the full description on the dataset page: https://huggingface.co/datasets/mwirth-epo/cpc-classification-data.","downloads":1843,"tags":["task_categories:text-classification","license:cc-by-sa-4.0","size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","legal"],"createdAt":"2024-12-05T10:57:04.000Z","key":""},{"_id":"6751deb598a34ccffaa97009","id":"infly/INF-ORM-Preference-Magnitude-80K","author":"infly","disabled":false,"gated":false,"lastModified":"2024-12-05T17:44:28.000Z","likes":10,"trendingScore":1,"private":false,"sha":"7298bedd776a246c14b809c81e1df8c4e57a4df5","downloads":46,"tags":["license:other","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-05T17:11:17.000Z","key":""},{"_id":"6755cc99bb1d50918ad5df9f","id":"nebius/SWE-agent-trajectories","author":"nebius","disabled":false,"gated":false,"lastModified":"2024-12-23T12:42:05.000Z","likes":87,"trendingScore":1,"private":false,"sha":"68195a1450865274106246d0d0296a1d6807b88e","description":"\n\t\n\t\t\n\t\tDataset Summary\n\t\n\nThis dataset contains 80,036 trajectories generated by a software engineering agent based on the SWE-agent framework, using various models as action generators. In these trajectories, the agent attempts to solve GitHub issues from the nebius/SWE-bench-extra and the dev split of princeton-nlp/SWE-bench.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Description\n\t\n\nThis dataset was created as part of a research project focused on developing a software engineering agent using open-weight models… See the full description on the dataset page: https://huggingface.co/datasets/nebius/SWE-agent-trajectories.","downloads":3020,"tags":["license:cc-by-4.0","size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","code","synthetic","tools","agents","software"],"createdAt":"2024-12-08T16:43:05.000Z","key":""},{"_id":"67566c5b948f4a4064e0ca39","id":"yczhuang/Hephaestus-Forge","author":"yczhuang","disabled":false,"gated":false,"lastModified":"2025-09-08T14:41:29.000Z","likes":2,"trendingScore":1,"private":false,"sha":"92839d13686c9bc3018c660ad25124a6999f93f9","downloads":118,"tags":["license:apache-2.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-09T04:04:43.000Z","key":""},{"_id":"6756b4471bc6b93608817e56","id":"AdrianUTokyo/ChinaPoliticalDiscourse","author":"AdrianUTokyo","disabled":false,"gated":false,"lastModified":"2024-12-09T09:12:49.000Z","likes":1,"trendingScore":1,"private":false,"sha":"87d3a352f93e21c1d7cecae05a45855090ddf4ed","downloads":24,"tags":["size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-09T09:11:35.000Z","key":""},{"_id":"6759d1efc9e617ff0c53ecbe","id":"mlfoundations-dev/stackexchange_linguistics","author":"mlfoundations-dev","disabled":false,"gated":false,"lastModified":"2024-12-23T17:47:23.000Z","likes":1,"trendingScore":1,"private":false,"sha":"682a211b4b03ec2f78981568d8706a124d45bc4c","downloads":8,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-11T17:54:55.000Z","key":""},{"_id":"675bc87d7b7bc33724fad105","id":"tanvirnwu/HazeSpace2M","author":"tanvirnwu","disabled":false,"gated":false,"lastModified":"2025-07-23T08:44:52.000Z","likes":4,"trendingScore":1,"private":false,"sha":"f17151a042535aaae941f535672fdde2a1f15fcf","description":"Published in ACM Multimedia 2024, Melbourne, Australia\nHazeSpace2M: A Dataset for Haze Aware Single Image Dehazing [Paper]\nMd Tanvir Islam 1, Nasir Rahim 1, Saeed Anwar 2, Muhammad Saqib 3, Sambit Bakshi 4, Khan Muhammad 1, *\n| 1. Sungkyunkwan University, South Korea | 2. ANU, Australia | 3. CSIRO, Australia | 4. NIT Rourkela, India || *Corresponding Author | \n\n\n\n\t\n\t\t\n\t\n\t\n\t\tIMPORTANT UPDATES\n\t\n\n\n2025/07/23 | The Satellite version is also uploaded now.\n2025/02/28 | We will try our best to make… See the full description on the dataset page: https://huggingface.co/datasets/tanvirnwu/HazeSpace2M.","downloads":312,"tags":["task_categories:image-classification","task_ids:multi-class-image-classification","language:en","license:cc-by-4.0","size_categories:100B<n<1T","modality:image","region:us","image","dehazing","dehazer","classification","haze","hazy","hazespace2m","removal","enhancement","restoration","image restoration","image enhancement","single image dehazing","multi weather dehazing","dehazing dataset"],"createdAt":"2024-12-13T05:39:09.000Z","key":""},{"_id":"675c53502b2bc1c9b592523f","id":"mlfoundations-dev/stackexchange_security","author":"mlfoundations-dev","disabled":false,"gated":false,"lastModified":"2024-12-23T18:37:35.000Z","likes":1,"trendingScore":1,"private":false,"sha":"726c7331ec32071d1fd0f846753a5979d85da1dc","downloads":2116,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-13T15:31:28.000Z","key":""},{"_id":"675c63365a0d6eb4529d41fb","id":"baobaoh/13-dimensions-music-emotions","author":"baobaoh","disabled":false,"gated":false,"lastModified":"2024-12-17T15:50:03.000Z","likes":3,"trendingScore":1,"private":false,"sha":"c1bb1886135ccae5213e1d541949a271d215d99b","description":"\n\t\n\t\t\n\t\tDataset Card for Music Emotion Ratings Across Cultures\n\t\n\nThis dataset captures the mean emotional category ratings for 1,841 music samples based on subjective experiences reported by participants from the United States and China. The ratings were collected as part of the study to uncover the universal and nuanced emotions evoked by instrumental music.\n\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Sources\n\t\n\n\nPaper: What music makes us feel: At least 13 dimensions organize subjective… See the full description on the dataset page: https://huggingface.co/datasets/baobaoh/13-dimensions-music-emotions.","downloads":117,"tags":["task_categories:feature-extraction","task_categories:audio-classification","size_categories:1K<n<10K","format:parquet","modality:audio","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","Music","Emotion","Recognition","MERT","Dataset","Audio"],"createdAt":"2024-12-13T16:39:18.000Z","key":""},{"_id":"675edb8be4d6d0e820b47f6f","id":"alvanlii/canto-asr-test","author":"alvanlii","disabled":false,"gated":false,"lastModified":"2024-12-15T17:08:52.000Z","likes":1,"trendingScore":1,"private":false,"sha":"9d488878c0651c5b2f950aa1a77320abaca8f5f0","downloads":128,"tags":["size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-15T13:37:15.000Z","key":""},{"_id":"675ede261c375f21ffe86ab0","id":"JadenGGGeee/NaVAB","author":"JadenGGGeee","disabled":false,"gated":false,"lastModified":"2024-12-30T04:01:59.000Z","likes":2,"trendingScore":1,"private":false,"sha":"be5dd5a1c8032fef99ad1354cef956bf20d2905f","description":"\n\t\n\t\t\n\t\tDataset Card for NaVAB\n\t\n\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nNaVAB is a comprehensive benchmark designed to evaluate the alignment of Large Language Models (LLMs) with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. The dataset addresses the limitations of existing benchmarks, which often fail to capture the dynamic nature of values across countries and lack sufficient evaluation data.\nThe dataset enables the… See the full description on the dataset page: https://huggingface.co/datasets/JadenGGGeee/NaVAB.","downloads":54,"tags":["language:en","language:zh","language:fr","language:de","license:apache-2.0","size_categories:10K<n<100K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-15T13:48:22.000Z","key":""},{"_id":"675f359dc0c290bb7a3f7e0e","id":"OpenSound/AudioCaps","author":"OpenSound","disabled":false,"gated":false,"lastModified":"2025-05-14T06:01:54.000Z","likes":26,"trendingScore":1,"private":false,"sha":"b29b3243d6ce49c2cd0d48d4b5f0701ae7969ded","downloads":3714,"tags":["license:cc-by-nc-4.0","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-15T20:01:33.000Z","key":""},{"_id":"6760045f18cddae8ef9547a5","id":"deboradum/GeoGuessr-countries","author":"deboradum","disabled":false,"gated":false,"lastModified":"2024-12-16T21:31:32.000Z","likes":2,"trendingScore":1,"private":false,"sha":"2a6258e1e6cde675cdb7366ba125d8456baa78bf","description":"Dataset containing 300 1600x900 images of 85 countries, taken from official GeoGuessr country maps. \nI created this dataset for a GeoGuessr country classifier. See my Github repo for training and evaluation code, or the model page to download the pre-trained models.\n","downloads":3148,"tags":["license:apache-2.0","modality:image","region:us"],"createdAt":"2024-12-16T10:43:43.000Z","key":""},{"_id":"676240bb382a5cd3c8b12a65","id":"zai-org/LongBench-v2","author":"zai-org","disabled":false,"gated":false,"lastModified":"2024-12-20T02:22:11.000Z","likes":48,"trendingScore":1,"private":false,"sha":"2b48e494f2c7a2f0af81aae178e05c7e1dde0fe9","description":"\n\t\n\t\t\n\t\tLongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks\n\t\n\n🌐 Project Page: https://longbench2.github.io\n💻 Github Repo: https://github.com/THUDM/LongBench\n📚 Arxiv Paper: https://arxiv.org/abs/2412.15204\nLongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 has the following features: (1) Length: Context length ranging from 8k to… See the full description on the dataset page: https://huggingface.co/datasets/zai-org/LongBench-v2.","downloads":47471,"tags":["task_categories:multiple-choice","task_categories:question-answering","task_categories:text-classification","task_categories:table-question-answering","language:en","license:apache-2.0","size_categories:n<1K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2412.15204","region:us","Long Context","reasoning"],"createdAt":"2024-12-18T03:25:47.000Z","key":""},{"_id":"67625b96b5980239b5bca664","id":"hwjiang/MegaSynth","author":"hwjiang","disabled":false,"gated":false,"lastModified":"2025-05-24T04:58:58.000Z","likes":11,"trendingScore":1,"private":false,"sha":"5e87f08642a7de9073d08a76f4784a52e8aaa6cc","downloads":510,"tags":["license:mit","region:us"],"createdAt":"2024-12-18T05:20:22.000Z","key":""},{"_id":"6765689d3b33c9e1b38cfe25","id":"Brunel-AI/ELOQUENCE","author":"Brunel-AI","disabled":false,"gated":false,"lastModified":"2026-06-25T18:08:20.000Z","likes":2,"trendingScore":1,"private":false,"sha":"dde0d684b2147af3ce0983a30b55c24303930fcc","description":"For the use of the eMetrics code, clone the repository, install the dependencies provided within the requirements.txt and pass the relevant arguments in.\nAdditional explanation for the use of the code is provided throughout.\nThe default configuration is for the use of the NaturalQuestions subset of the fused dataset.\n","downloads":155,"tags":["license:cc-by-4.0","size_categories:1K<n<10K","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us"],"createdAt":"2024-12-20T12:52:45.000Z","key":""},{"_id":"6766e59b81fcd18966169c52","id":"adyen/DABstep","author":"adyen","disabled":false,"gated":false,"lastModified":"2026-06-28T09:02:22.000Z","likes":51,"trendingScore":1,"private":false,"sha":"862b91cfcaeb6262efd553d661f7802e682f52e1","description":"\n\t\n\t\t\n\t\n\t\n\t\tData Agent Benchmark for Multi-step Reasoning (DABstep) Dataset\n\t\n\nThis repository hosts a HF Dataset the supports the benchmark and leaderboard. \nFor the main entrypoint to the benchmark, see the leaderboard here: \nhttps://huggingface.co/spaces/adyen/DABstep \nThis Dataset has 3 splits:\n\ntasks\nsubmissions\ntask_scores\n\nUsers of the benchmark would read from the tasks split to run the baseline. The other splits are used to support the leaderboard.\nThe datasets are in the data/context… See the full description on the dataset page: https://huggingface.co/datasets/adyen/DABstep.","downloads":9066,"tags":["license:cc-by-4.0","size_categories:1M<n<10M","format:json","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2024-12-21T15:58:19.000Z","key":""},{"_id":"67675eec3909129a0a535c25","id":"drive-bench/arena","author":"drive-bench","disabled":false,"gated":false,"lastModified":"2025-02-22T00:41:46.000Z","likes":12,"trendingScore":1,"private":false,"sha":"dc3142b3c648d2e8543dd82dad5178b2cfddadad","downloads":358,"tags":["task_categories:question-answering","language:en","license:cc-by-4.0","size_categories:1K<n<10K","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-22T00:35:56.000Z","key":""},{"_id":"676bad1d1f5ca46174be04e7","id":"albertgrigoryan/eastern_armenian_tigran_nune","author":"albertgrigoryan","disabled":false,"gated":false,"lastModified":"2024-12-25T12:46:31.000Z","likes":2,"trendingScore":1,"private":false,"sha":"746b400bf27331411f7a6a5fc232440f1ab0a88d","downloads":65,"tags":["license:gpl-2.0","region:us"],"createdAt":"2024-12-25T06:58:37.000Z","key":""},{"_id":"676cd90c7f3cf27f2e0de3c9","id":"JeffreyXiang/TRELLIS-500K","author":"JeffreyXiang","disabled":false,"gated":false,"lastModified":"2024-12-26T04:21:59.000Z","likes":60,"trendingScore":1,"private":false,"sha":"288b06111bc4d06800293ab087989f852a632af2","description":"\n\t\n\t\t\n\t\tTRELLIS-500K\n\t\n\nTRELLIS-500K is a dataset of 500K 3D assets curated from Objaverse(XL), ABO, 3D-FUTURE, HSSD, and Toys4k, filtered based on aesthetic scores.\nThis dataset serves for 3D generation tasks.\nIt was introduced in the paper Structured 3D Latents for Scalable and Versatile 3D Generation.\n\n\t\n\t\n\t\n\t\tDataset Statistics\n\t\n\nThe following table summarizes the dataset's filtering and composition:\nNOTE: Some of the 3D assets lack text captions. Please filter out such assets if captions… See the full description on the dataset page: https://huggingface.co/datasets/JeffreyXiang/TRELLIS-500K.","downloads":959,"tags":["task_categories:image-to-3d","task_categories:text-to-3d","language:en","license:mit","size_categories:100K<n<1M","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2412.01506","region:us"],"createdAt":"2024-12-26T04:18:20.000Z","key":""},{"_id":"676d1189021be0726d61d1a3","id":"ASCIIEval/ASCIITune","author":"ASCIIEval","disabled":false,"gated":false,"lastModified":"2025-11-24T06:08:38.000Z","likes":1,"trendingScore":1,"private":false,"sha":"64528d2f530b65f0157035c479162cbc1d438fda","description":"\n     ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art \n\n\n\n\n📖 Arxiv |\n🤗 ASCIIEval Dataset |\n🤗 ASCIITune Dataset\n\n\n\n\nTABLE OF CONTENTS\n\nIntroduction\nData\nLeaderboards\nLeaderboard for Textual Input\nLeaderboard for Image Input\nLeaderboard for Average Cross-Modality Performance\n\n\nCitation\n\n\n\t\n\t\t\n\t\tIntroduction\n\t\n\nPerceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and… See the full description on the dataset page: https://huggingface.co/datasets/ASCIIEval/ASCIITune.","downloads":85,"tags":["task_categories:visual-question-answering","task_categories:text-generation","task_categories:image-text-to-text","task_categories:question-answering","language:en","license:cc-by-nc-4.0","arxiv:2410.01733","region:us","ascii-art","multimodal","llm","mllm","visual-perception","benchmark"],"createdAt":"2024-12-26T08:19:21.000Z","key":""},{"_id":"6770006a98612a870dec96c4","id":"mrs83/kurtis_mental_health_dpo","author":"mrs83","disabled":false,"gated":"auto","lastModified":"2024-12-29T00:33:50.000Z","likes":4,"trendingScore":1,"private":false,"sha":"e7909ac1e9655d29a80d320477d5e311b89614a4","description":"\n\t\n\t\t\n\t\t⚠️ Content Warning\n\t\n\nThis dataset contains content that some users may find offensive or harmful. Viewer discretion is advised.\n\n\t\n\t\t\n\t\tModel Card: Kurtis DPO dataset.\n\t\n\n\n\t\n\t\t\n\t\tDescription\n\t\n\nThis dataset was created using the microsoft/Phi-3.5-mini-instruct model to generate adversarial responses for alignment training. \nThe model was particularly effective in crafting toxic, biased, or otherwise harmful responses to provided prompts. \nThese responses were then filtered and… See the full description on the dataset page: https://huggingface.co/datasets/mrs83/kurtis_mental_health_dpo.","downloads":8,"tags":["task_categories:text-generation","task_categories:text-classification","language:en","license:mit","size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2305.18290","region:us","synthetic","dpo"],"createdAt":"2024-12-28T13:43:06.000Z","key":""},{"_id":"67702e7706122af57ba586c4","id":"BounharAbdelaziz/Dvoice-v2-cleaned","author":"BounharAbdelaziz","disabled":false,"gated":"auto","lastModified":"2025-01-19T01:09:09.000Z","likes":1,"trendingScore":1,"private":false,"sha":"3856fa441dc5c458c8eddb0ad545f80f255bc9a9","description":"\n\t\n\t\t\n\t\tDataset Overview\n\t\n\nThis dataset is a cleaned version of the Dvoice-v2 test set, curated and enhanced for easier use in speech and language processing tasks.\nThe original dataset contains many errors, with audios not matching their transcripts.\nThis dataset has been manually annotated by Abdelaziz Bounhar, with the following additional labels:\n\nArabizi transcription: Transcription in latin scripts.\nGender: Classification of speaker's gender.\nSentiment: Sentiment label of the speaker's… See the full description on the dataset page: https://huggingface.co/datasets/BounharAbdelaziz/Dvoice-v2-cleaned.","downloads":56,"tags":["size_categories:n<1K","format:parquet","modality:audio","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2024-12-28T16:59:35.000Z","key":""},{"_id":"677246f593724e487d680fd1","id":"MahtaFetrat/HomoRich-G2P-Persian","author":"MahtaFetrat","disabled":false,"gated":false,"lastModified":"2025-09-12T19:31:29.000Z","likes":8,"trendingScore":1,"private":false,"sha":"21b189a76c15e9a899e9606c6a909d30e9dc4596","description":"\n\t\n\t\t\n\t\tHomoRich: A Persian Homograph Dataset for G2P Conversion\n\t\n\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nHomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:\"Fast, Not Fancy: Rethinking G2P… See the full description on the dataset page: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian.","downloads":242,"tags":["task_categories:translation","task_categories:text-to-speech","language:fa","license:cc0-1.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2505.12973","doi:10.57967/hf/6420","region:us","g2p","grapheme-to-phoneme","homograph","persian","homorich","phoneme-translation","farsi","phonemization","homograph-disambiguation","dataset"],"createdAt":"2024-12-30T07:08:37.000Z","key":""},{"_id":"67769d308184cf6ae9e463bb","id":"argilla/synthetic-concise-reasoning-sft-filtered","author":"argilla","disabled":false,"gated":false,"lastModified":"2025-01-03T11:21:56.000Z","likes":8,"trendingScore":1,"private":false,"sha":"1b68fc2fb54220404cb04a3898302e90d53c1ed0","downloads":60,"tags":["size_categories:1K<n<10K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-02T14:05:36.000Z","key":""},{"_id":"677c75967d036e490734b56d","id":"gmongaras/Imagenet21K","author":"gmongaras","disabled":false,"gated":false,"lastModified":"2025-02-03T04:57:53.000Z","likes":7,"trendingScore":1,"private":false,"sha":"a23c2bbe64852ae38c386ec5fdb64767b233f4e0","description":"NOTE: I have recaptioned all images here\nThis dataset is the entire 21K ImageNet dataset with about 13 million examples and about 19 thousand classes as strings \n(for some reason it only had ~19K classes instead of 21K).\nThe images are in PNG format. They can be decoded like in the following example\nimport io\nfrom PIL import Image\nImage.open(io.BytesIO(row[\"image\"]))\n\nwhere row[\"image\"] are the raw image bytes.\n","downloads":1453,"tags":["size_categories:10M<n<100M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-07T00:30:14.000Z","key":""},{"_id":"677cf0c16ef55475dd8655ba","id":"WeiChow/PhysBench-train","author":"WeiChow","disabled":false,"gated":false,"lastModified":"2025-01-11T07:55:10.000Z","likes":2,"trendingScore":1,"private":false,"sha":"acf8b16b2f88af55ea9e8be650b794b28964e94a","description":"\n  PhysBench \n\n\n    🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 🔺 EvalAI\n\n\nThis repo contains other dataset for PhysBench for further exploration ⚙️.\nNotice: The classification of this dataset is not exactly the same as the downstream test set, and the test set is more diverse and has been more finely labeled and screened. This dataset is for reference only\nUsage:\nsudo apt-get install unrar\nunrar x video.part1.rar\nunrar x image.part1.rar\nunrar x auxiliary_image.part01.rar\nOther links:… See the full description on the dataset page: https://huggingface.co/datasets/WeiChow/PhysBench-train.","downloads":258,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:10K<n<100K","region:us"],"createdAt":"2025-01-07T09:15:45.000Z","key":""},{"_id":"677d1ce4afd8031632cb203e","id":"bigcode/starcoder2data-extras","author":"bigcode","disabled":false,"gated":false,"lastModified":"2025-03-19T19:33:51.000Z","likes":13,"trendingScore":1,"private":false,"sha":"1ba0d4f31e4c3b6d8586505669841432a19b8c16","description":"\n\t\n\t\t\n\t\tStarCoder2 Extras\n\t\n\nThis is the dataset of extra sources (besides Stack v2 code data) used to train the StarCoder2 family of models. It contains the following subsets:\n\nKaggle (kaggle): Kaggle notebooks from Meta-Kaggle-Code dataset, converted to scripts and prefixed with information on the Kaggle datasets used in the notebook. The file headers have a similar format to Jupyter Structured but the code content is only one single script.\nStackOverflow (stackoverflow): stackoverflow… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/starcoder2data-extras.","downloads":3081,"tags":["size_categories:10M<n<100M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2402.19173","region:us"],"createdAt":"2025-01-07T12:24:04.000Z","key":""},{"_id":"677e56841e04b38abb5bad2a","id":"SRCantona/panoramic_x-ray_tooth","author":"SRCantona","disabled":false,"gated":false,"lastModified":"2025-01-08T13:13:13.000Z","likes":2,"trendingScore":1,"private":false,"sha":"9b07d4eb0210bd7e3dceb9f5e1501015a516b14d","description":"\n\t\n\t\t\n\t\tPanoramic X-ray Tooth Dataset\n\t\n\n\n\t\n\t\t\n\t\tDescription\n\t\n\nThis dataset, curated by Humans in the Loop, features over 500 panoramic dental X-ray images, complete with segmentation masks and annotation files. Designed to support dental research, machine learning model development, and automated dental diagnostics, the dataset showcases a variety of dental conditions and tooth structures. The high-quality segmentation masks, produced through expert human annotation, enhance the precision… See the full description on the dataset page: https://huggingface.co/datasets/SRCantona/panoramic_x-ray_tooth.","downloads":54,"tags":["language:en","language:ar","size_categories:1K<n<10K","format:imagefolder","modality:image","library:datasets","library:mlcroissant","region:us","biology","dental","AI","teeth","medical"],"createdAt":"2025-01-08T10:42:12.000Z","key":""},{"_id":"677f5729498775357d040e0d","id":"RZ412/PokerBench","author":"RZ412","disabled":false,"gated":false,"lastModified":"2026-01-08T01:02:59.000Z","likes":38,"trendingScore":1,"private":false,"sha":"7ac61f961c81a50fc0f667820b2fb0e432dfec0d","description":"\n\t\n\t\t\n\t\tPokerBench Overview\n\t\n\nThis dataset contains natural language game scenarios and optimal decisions computed by solvers in No Limit Texas Hold’em. It is divided into pre-flop and post-flop datasets, each with training and test splits. The data is stored in both .json and .csv formats:\n\nJSON files: Contain the natural language prompts (instruction) and optimal decisions (output) derived from the game scenarios.\n\nCSV files: Contain structured game information from which the JSON files… See the full description on the dataset page: https://huggingface.co/datasets/RZ412/PokerBench.","downloads":1258,"tags":["task_categories:other","language:en","license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:dask","library:mlcroissant","arxiv:2501.08328","region:us","poker","decision-making"],"createdAt":"2025-01-09T04:57:13.000Z","key":""},{"_id":"67806c6743a58ab7b52ef7ec","id":"Josephgflowers/Finance-Instruct-500k","author":"Josephgflowers","disabled":false,"gated":false,"lastModified":"2026-02-24T05:28:29.000Z","likes":229,"trendingScore":1,"private":false,"sha":"583a98fb0ec14d904e9423b671d9d0fea88891b6","description":"\n\t\n\t\t\n\t\tFinance-Instruct-500k Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nFinance-Instruct-500k is a comprehensive and meticulously curated dataset designed to train advanced language models for financial tasks, reasoning, and multi-turn conversations. Combining data from numerous high-quality financial datasets, this corpus provides over 500,000 entries, offering unparalleled depth and versatility for finance-related instruction tuning and fine-tuning.\nThe dataset includes content tailored for financial… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k.","downloads":1006,"tags":["license:apache-2.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","finance","fine-tuning","conversational-ai","named-entity-recognition","sentiment-analysis","topic-classification","rag","multilingual","lightweight-llm"],"createdAt":"2025-01-10T00:40:07.000Z","key":""},{"_id":"67861adc1f861473a2b89132","id":"MMDocIR/MMDocIR_Evaluation_Dataset","author":"MMDocIR","disabled":false,"gated":false,"lastModified":"2025-05-28T03:08:56.000Z","likes":11,"trendingScore":1,"private":false,"sha":"bdcb36ecb3eee73667180ee3fb24fe433f6dd2a4","description":"\n\t\n\t\t\n\t\tEvaluation Datasets\n\t\n\n\n\t\n\t\t\n\t\tEvaluation Set Overview\n\t\n\nMMDocIR evaluation set includes 313 long documents averaging 65.1 pages, categorized into ten main domains: research reports, administration&industry, tutorials&workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles. Different domains feature distinct distributions of multi-modal information. Overall, the modality distribution is: Text (60.4%), Image (18.8%), Table… See the full description on the dataset page: https://huggingface.co/datasets/MMDocIR/MMDocIR_Evaluation_Dataset.","downloads":575,"tags":["license:apache-2.0","arxiv:2501.08828","region:us"],"createdAt":"2025-01-14T08:05:48.000Z","key":""},{"_id":"67887ab869ed6148533ed38b","id":"BIOMEDICA/biomedica_webdataset_24M","author":"BIOMEDICA","disabled":false,"gated":"auto","lastModified":"2025-01-22T01:06:11.000Z","likes":36,"trendingScore":1,"private":false,"sha":"d8282587b5595fc8a29dee7f4ffbe5b1731f6011","description":"\n\t\n\t\t\n\t\tDataset Card for Dataset Name\n\t\n\n\n    \n\n\n\n\n\n  Arxiv: Arxiv \n      |    \n  Website: Biomedica\n      |    \n Training instructions: OpenCLIP\n      |    \n  Tutorial: Google Colab\n \n\n\n\n\n\n\n\nBIOMEDICA Dataset is a large-scale, deep-learning-ready biomedical dataset containing over 24M imagecaption pairs and 30M image-references from 6M unique open-source articles. Each data point is highly annotated with over 27 unique metadata fields, including article level information (e.g., license… See the full description on the dataset page: https://huggingface.co/datasets/BIOMEDICA/biomedica_webdataset_24M.","downloads":1619,"tags":["size_categories:n>1T","arxiv:2501.07171","region:us","medical","biology","chemistry"],"createdAt":"2025-01-16T03:19:20.000Z","key":""},{"_id":"678a59c3e9c4df77fd1275ef","id":"CarlanLark/pasa-dataset","author":"CarlanLark","disabled":false,"gated":"auto","lastModified":"2025-01-20T14:37:57.000Z","likes":35,"trendingScore":1,"private":false,"sha":"232428b0c867268c3b8ded90db4d98c1b30501d6","description":"\n\t\n\t\t\n\t\tPaSa Dataset\n\t\n\nData for the paper: PaSa: An LLM Agent for Comprehensive Academic Paper Search\nFurther information: https://github.com/bytedance/pasa\n\nPaSa: An LLM Agent for Comprehensive Academic Paper Search\nYichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E\nPaper link: https://arxiv.org/abs/2501.10120\n\n","downloads":393,"tags":["arxiv:2501.10120","region:us"],"createdAt":"2025-01-17T13:23:15.000Z","key":""},{"_id":"678cb67b281a0e32fe6f46d5","id":"hotchpotch/sentence_transformer_japanese","author":"hotchpotch","disabled":false,"gated":false,"lastModified":"2025-01-20T10:19:14.000Z","likes":7,"trendingScore":1,"private":false,"sha":"e919174ab810b903e5f08f8f07fab4b83ed166de","description":"\n日本語のデータセットを SentenceTransformes で学習しやすいカラム名と構造に変換したもの。\n主に (anchor, positive), (anchor, positive, negative), (anchor, positive, negative_1, ..., negative_n) といった構造になっているため、とりわけ対照学習で使いやすくなっています。\n\n\n以下のデータセットから作成\nhttps://huggingface.co/datasets/hpprc/emb\nhttps://huggingface.co/datasets/hotchpotch/hpprc_emb-scores のリランカースコアを用いて、positive(>=0.7) / negative(<=0.3) のフィルタリングを行った\n\n\nhttps://huggingface.co/datasets/hpprc/llmjp-kaken\nhttps://huggingface.co/datasets/hpprc/msmarco-ja… See the full description on the dataset page: https://huggingface.co/datasets/hotchpotch/sentence_transformer_japanese.","downloads":111,"tags":["language:ja","license:unknown","size_categories:10M<n<100M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-19T08:23:23.000Z","key":""},{"_id":"678cc761259e319f8166f325","id":"o0dimplz0o/Zeroth-STT-Korean","author":"o0dimplz0o","disabled":false,"gated":false,"lastModified":"2025-01-20T09:54:55.000Z","likes":5,"trendingScore":1,"private":false,"sha":"c8ed68c1c86dc1528c8e5fa16be12b983cfbbbe2","description":"\n\t\n\t\t\n\t\tZeroth-STT-Korean Dataset\n\t\n\n\n\t\n\t\t\n\t\tDescription\n\t\n\nThis is a shuffled version of the Zeroth-STT-Ko dataset.\n\n\t\n\t\t\n\t\tCitation\n\t\n\nZeroth-Korean Dataset, created by [Lucas Jo(@Atlas Guide Inc.) and Wonkyum Lee(@Gridspace Inc.)], 2023.\nAvailable at https://github.com/goodatlas/zeroth under CC-BY-4.0 license.\nJunhoee/STT_Korean_Dataset_80000 Dataset, created by [Junhoee], 2024.\nAvailable at https://huggingface.co/datasets/Junhoee/STT_Korean_Dataset_80000\n","downloads":220,"tags":["task_categories:automatic-speech-recognition","language:ko","license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-19T09:35:29.000Z","key":""},{"_id":"678e5dc04a599f529d72e072","id":"dvgodoy/CUAD_v1_Contract_Understanding_PDF","author":"dvgodoy","disabled":false,"gated":false,"lastModified":"2025-01-29T18:32:20.000Z","likes":1,"trendingScore":1,"private":false,"sha":"b66f9c556471d32127e21f3c171a11c34336b87f","description":"\n\t\n\t\t\n\t\tDataset Card for Contract Understanding Atticus Dataset (CUAD) PDF\n\t\n\nThis dataset contains the PDFs and the full text of 509 commercial legal contracts from the original CUAD dataset. One of the original 510 contracts was removed due to being a scanned copy.\nThe extracted text was cleaned using clean-text.\nThe PDFs were encoded in base64 and added as the pdf_bytes_base64 feature.\nYou can easily and quickly load it:\ndataset = load_dataset(\"dvgodoy/CUAD_v1_Contract_Understanding_PDF\")… See the full description on the dataset page: https://huggingface.co/datasets/dvgodoy/CUAD_v1_Contract_Understanding_PDF.","downloads":146,"tags":["language:en","license:cc-by-4.0","size_categories:n<1K","format:parquet","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2103.06268","region:us","PDF"],"createdAt":"2025-01-20T14:29:20.000Z","key":""},{"_id":"678e60cecdde0f9efe15b589","id":"librarian-bots/dataset-columns","author":"librarian-bots","disabled":false,"gated":false,"lastModified":"2026-06-30T01:57:53.000Z","likes":4,"trendingScore":1,"private":false,"sha":"29dca783aada337afdf955cbe7286da3212b2463","downloads":591,"tags":["size_categories:100M<n<1B","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","region:us"],"createdAt":"2025-01-20T14:42:22.000Z","key":""},{"_id":"67915b42013258a9daf6636f","id":"bespokelabs/Bespoke-Stratos-35k","author":"bespokelabs","disabled":false,"gated":false,"lastModified":"2025-01-22T20:55:45.000Z","likes":6,"trendingScore":1,"private":false,"sha":"e25709a08e900b2b346eb67387d4c55ef64f16fb","downloads":78,"tags":["size_categories:10K<n<100K","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-22T20:55:30.000Z","key":""},{"_id":"679212f0e75ed5c939a2be9a","id":"atlasia/DODa-audio-dataset","author":"atlasia","disabled":false,"gated":"auto","lastModified":"2025-02-07T14:36:46.000Z","likes":21,"trendingScore":1,"private":false,"sha":"588aec51a3f007fa23082766ce801c483b5cc0bb","description":"\n\t\n\t\t\n\t\tMoroccan Darija Speech Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset consists of 12,743 parallel text and speech samples for Moroccan Darija, including its transcription in both Latin and Arabic scripts and English translations. It was created to support speech recognition, language modeling, and NLP tasks for Moroccan Darija.\n\n\t\n\t\t\n\t\tDataset Source\n\t\n\nThe dataset was originally sourced from this repository, where it was available as a CSV file containing three columns:\n\ndarija:… See the full description on the dataset page: https://huggingface.co/datasets/atlasia/DODa-audio-dataset.","downloads":324,"tags":["language:en","license:mit","size_categories:10K<n<100K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-23T09:59:12.000Z","key":""},{"_id":"6792405d5bb0ab8acca8bfc0","id":"atlasia/Moroccan-Darija-Wiki-Audio-Dataset","author":"atlasia","disabled":false,"gated":"auto","lastModified":"2025-02-14T12:35:13.000Z","likes":15,"trendingScore":1,"private":false,"sha":"c892a371521207f96e236711c41e43f80c17ac12","description":"\n  \n\n\n\n\n\t\n\t\t\n\t\tMoroccan Darija Wiki Audio Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThe Moroccan Darija Wiki Audio Dataset consists of 551 parallel text and speech samples of Moroccan Darija sourced from Wikipedia Darija . This dataset is designed to support speech recognition, language modeling, and various NLP tasks for Moroccan Darija.  \n\n\t\n\t\t\n\t\tDataset Source\n\t\n\nThe data was scraped from Wikipedia (ary) using the WikiScraper tool.  \n\n\t\n\t\t\n\t\tData Preprocessing\n\t\n\nTo ensure data quality, we applied… See the full description on the dataset page: https://huggingface.co/datasets/atlasia/Moroccan-Darija-Wiki-Audio-Dataset.","downloads":171,"tags":["license:cc-by-4.0","size_categories:n<1K","format:parquet","modality:audio","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","library:webdataset","region:us","audio","webdataset"],"createdAt":"2025-01-23T13:13:01.000Z","key":""},{"_id":"6795e2882ec68b4193d4dbf2","id":"EricLu/SCP-116K","author":"EricLu","disabled":false,"gated":false,"lastModified":"2026-05-08T06:04:15.000Z","likes":125,"trendingScore":1,"private":false,"sha":"cfff563b98222d48802fcab0768ff523ee60f713","description":"\n\t\n\t\t\n\t\tNew Version Available: SCP-378K\n\t\n\nA new version of this dataset has been released:\n👉 EricLu/SCP-378K\nSCP-378K is a larger and improved scientific problem-solution dataset containing 377,705 examples. Unlike the current SCP-116K repository, where extracted source solutions are available for only a subset of problems, every problem in SCP-378K is paired with an extracted standard solution from the source material.\nThe key differences between SCP-116K and SCP-378K are summarized below:… See the full description on the dataset page: https://huggingface.co/datasets/EricLu/SCP-116K.","downloads":443,"tags":["task_categories:text-generation","task_categories:question-answering","language:en","license:cc-by-nc-sa-4.0","size_categories:100K<n<1M","format:json","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2501.15587","region:us","chemistry","biology","medical","mathematics"],"createdAt":"2025-01-26T07:21:44.000Z","key":""},{"_id":"6796c72b7534713f94cc75b3","id":"songlab/TraitGym","author":"songlab","disabled":false,"gated":false,"lastModified":"2025-03-25T19:09:05.000Z","likes":10,"trendingScore":1,"private":false,"sha":"1fde19555fe8c0a55b1382bdf4c6f7082209f566","description":"\n\t\n\t\t\n\t\t🧬 TraitGym\n\t\n\nBenchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics\n🏆 Leaderboard: https://huggingface.co/spaces/songlab/TraitGym-leaderboard\n\n\t\n\t\t\n\t\t⚡️ Quick start\n\t\n\n\nLoad a datasetfrom datasets import load_dataset\n\ndataset = load_dataset(\"songlab/TraitGym\", \"mendelian_traits\", split=\"test\")\n\n\nExample notebook to run variant effect prediction with a gLM, runs in 5 min on Google Colab: TraitGym.ipynb \n\n\n\t\n\t\n\t\n\t\t🤗 Resources… See the full description on the dataset page: https://huggingface.co/datasets/songlab/TraitGym.","downloads":4002,"tags":["license:mit","size_categories:10M<n<100M","format:parquet","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","dna","variant-effect-prediction","biology","genomics"],"createdAt":"2025-01-26T23:37:15.000Z","key":""},{"_id":"67971dfab546b713001621e0","id":"lmms-lab/multimodal-open-r1-8k-verified","author":"lmms-lab","disabled":false,"gated":false,"lastModified":"2025-01-27T05:48:31.000Z","likes":76,"trendingScore":1,"private":false,"sha":"e3c8f3adf1dfc2b674cf736181ae6b9f9839a3ce","downloads":1153,"tags":["size_categories:1K<n<10K","format:parquet","modality:image","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-27T05:47:38.000Z","key":""},{"_id":"6797e648de960c48ff034e54","id":"open-thoughts/OpenThoughts-114k","author":"open-thoughts","disabled":false,"gated":false,"lastModified":"2025-08-31T00:24:46.000Z","likes":867,"trendingScore":1,"private":false,"sha":"bd093c3994fd54d2390985b66988ddf282a55eb6","description":"\n    \n\n\n\n[!NOTE]\nWe have released a paper for OpenThoughts! See our paper here.\n\n\n \n\n\n\n\t\n\t\t\n\t\n\t\n\t\tOpen-Thoughts-114k\n\t\n\nOpen synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles!\nInspect the content with rich formatting with Curator Viewer.\n\n\t\n\t\n\t\n\t\tAvailable Subsets\n\t\n\ndefault subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models:\nds = load_dataset(\"open-thoughts/OpenThoughts-114k\", split=\"train\")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.","downloads":71732,"tags":["license:apache-2.0","size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:polars","library:mlcroissant","arxiv:2506.04178","region:us","curator","synthetic"],"createdAt":"2025-01-27T20:02:16.000Z","key":""},{"_id":"67988f98d8e2dcea3d2427d4","id":"rogue-security/prompt-injections-benchmark","author":"rogue-security","disabled":false,"gated":"auto","lastModified":"2026-04-02T08:52:25.000Z","likes":39,"trendingScore":1,"private":false,"sha":"9ef1aa46a7e5eedb096be0481be8011ede1e72e8","description":"\n\t\n\t\t\n\t\tDataset: Qualifire Benchmark Prompt Injection(Jailbreak vs. Benign) Datasets\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis dataset contains 5,000 prompts, each labeled as either jailbreak or benign. The dataset is designed for evaluating AI models' robustness against adversarial prompts and their ability to distinguish between safe and unsafe inputs.\n\n\t\n\t\t\n\t\tDataset Structure\n\t\n\n\nTotal Samples: 5,000\nLabels: jailbreak, benign\nColumns: \ntext: The input text\nlabel: The classification (jailbreak or benign)… See the full description on the dataset page: https://huggingface.co/datasets/rogue-security/prompt-injections-benchmark.","downloads":1262,"tags":["license:cc-by-nc-4.0","size_categories:1K<n<10K","format:parquet","format:optimized-parquet","modality:text","library:datasets","library:pandas","library:polars","library:mlcroissant","region:us","prompt","injection","jailbreak","benign"],"createdAt":"2025-01-28T08:04:40.000Z","key":""},{"_id":"6798f9f0f138ec2fde02033f","id":"aadityaubhat/synthetic-emotions","author":"aadityaubhat","disabled":false,"gated":false,"lastModified":"2025-03-19T18:46:58.000Z","likes":6,"trendingScore":1,"private":false,"sha":"7b30c7585105756084e5a94a69a59313aeb2ece5","description":"\n\t\n\t\t\n\t\tSynthetic Emotions Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nSynthetic Emotions is a video dataset of AI-generated human emotions created using OpenAI Sora. It features short (5-sec, 480p, 9:16) videos depicting diverse individuals expressing emotions like happiness, sadness, anger, fear, surprise, and more.\nThis dataset is ideal for emotion recognition, facial expression analysis, affective computing, and AI-human interaction research.\n\n\t\n\t\t\n\t\tDataset Details\n\t\n\nTotal Videos: 100\nVideo Format:… See the full description on the dataset page: https://huggingface.co/datasets/aadityaubhat/synthetic-emotions.","downloads":168,"tags":["task_categories:video-classification","task_categories:text-to-video","license:mit","size_categories:n<1K","modality:text","modality:video","library:datasets","library:mlcroissant","doi:10.57967/hf/4297","region:us","video"],"createdAt":"2025-01-28T15:38:24.000Z","key":""},{"_id":"6799b71a0866dc379fcb6294","id":"AfterQuery/FinanceQA","author":"AfterQuery","disabled":false,"gated":false,"lastModified":"2025-02-21T01:12:15.000Z","likes":20,"trendingScore":1,"private":false,"sha":"6eb03b46c2e7f71ad52f7db4d12a9eabd523f575","description":"FinanceQA is a comprehensive testing suite designed to evaluate LLMs' performance on complex financial analysis tasks that mirror real-world investment work. The dataset aims to be substantially more challenging and practical than existing financial benchmarks, focusing on tasks that require precise calculations and professional judgment.\nPaper: https://arxiv.org/abs/2501.18062\nDescription  \nThe dataset contains two main categories of questions:\n\nTactical Questions: Questions based on… See the full description on the dataset page: https://huggingface.co/datasets/AfterQuery/FinanceQA.","downloads":403,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:n<1K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","arxiv:2501.18062","region:us","finance"],"createdAt":"2025-01-29T05:05:30.000Z","key":""},{"_id":"679ace31f8db930fcde8b775","id":"alea-institute/kl3m-data-dotgov-www.cftc.gov","author":"alea-institute","disabled":false,"gated":false,"lastModified":"2025-04-11T01:49:53.000Z","likes":1,"trendingScore":1,"private":false,"sha":"a8b6f41aea276b11d293ae5079d9d2f236c9c320","description":"\n\t\n\t\t\n\t\tKL3M Data Project\n\t\n\n\nNote: This page provides general information about the KL3M Data Project. Additional details specific to this dataset will be added in future updates. For complete information, please visit the GitHub repository or refer to the KL3M Data Project paper.\n\n\n\t\n\t\t\n\t\n\t\n\t\tDescription\n\t\n\nThis dataset is part of the ALEA Institute's KL3M Data Project, which provides copyright-clean training resources for large language models.\n\n\t\n\t\n\t\n\t\tDataset Details\n\t\n\n\nFormat: Parquet… See the full description on the dataset page: https://huggingface.co/datasets/alea-institute/kl3m-data-dotgov-www.cftc.gov.","downloads":19,"tags":["size_categories:100K<n<1M","format:parquet","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","arxiv:2504.07854","arxiv:2503.17247","region:us"],"createdAt":"2025-01-30T00:56:17.000Z","key":""},{"_id":"679ae77de7f671635d858841","id":"QuixiAI/dolphin-r1","author":"QuixiAI","disabled":false,"gated":false,"lastModified":"2025-01-30T18:51:36.000Z","likes":302,"trendingScore":1,"private":false,"sha":"f6ac651b3911352ce9bc6d3340c98a66007feb88","description":"\n\t\n\t\t\n\t\tDolphin R1 🐬\n\t\n\nAn Apache-2.0 dataset curated by Eric Hartford and Cognitive Computations\n\nDiscord: https://discord.gg/cognitivecomputations\n\n\n\n\t\n\t\t\n\t\n\t\n\t\tSponsors\n\t\n\nOur appreciation for the generous sponsors of Dolphin R1 - Without whom this dataset could not exist.\n\nDria https://x.com/driaforall - Inference Sponsor (DeepSeek)\nChutes https://x.com/rayon_labs - Inference Sponsor (Flash)\nCrusoe Cloud - Compute Sponsor\nAndreessen Horowitz - provided the grant that originally launched… See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/dolphin-r1.","downloads":844,"tags":["license:apache-2.0","size_categories:100K<n<1M","format:json","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-30T02:44:13.000Z","key":""},{"_id":"679b73794ba73ebdd8b45f5f","id":"franciellevargas/HateBR","author":"franciellevargas","disabled":false,"gated":false,"lastModified":"2025-03-09T16:04:04.000Z","likes":6,"trendingScore":1,"private":false,"sha":"077db456f6c4b376540ef07fbcfdec61e7806cc7","description":"\n\t\n\t\t\n\t\tHateBR: The Evaluation Benchmark for Brazilian Portuguese Hate Speech Detection\n\t\n\nHateBR is the first large-scale, expert-annotated dataset of Brazilian Instagram comments specifically designed for hate speech detection on the web and social media. The dataset was collected from Brazilian Instagram comments made by politicians and manually annotated by specialists.\nIt contains 7,000 documents, annotated across three distinct layers:\nBinary classification (offensive vs. non-offensive… See the full description on the dataset page: https://huggingface.co/datasets/franciellevargas/HateBR.","downloads":336,"tags":["task_categories:text-classification","language:pt","license:apache-2.0","size_categories:1K<n<10K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","hate","hatespeech","brazilianportuguese","evaluationbenchmark","naturallanguageprocessing","machinelearning"],"createdAt":"2025-01-30T12:41:29.000Z","key":""},{"_id":"679c449bb64482d4adcd79d3","id":"preetam7/dynamic_kgqa","author":"preetam7","disabled":false,"gated":"auto","lastModified":"2025-02-15T21:49:51.000Z","likes":3,"trendingScore":1,"private":false,"sha":"67f9e639f1e2d352841ec483b6ed067603db820d","downloads":25,"tags":["size_categories:100K<n<1M","format:parquet","modality:tabular","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-01-31T03:33:47.000Z","key":""},{"_id":"679f005a8e344720aebf9ae2","id":"leonardPKU/clevr_cogen_a_train","author":"leonardPKU","disabled":false,"gated":false,"lastModified":"2025-02-02T05:29:39.000Z","likes":41,"trendingScore":1,"private":false,"sha":"d5a8da66aa97c9c19ebe4075b7c7f777fef850a6","downloads":476,"tags":["size_categories:10K<n<100K","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-02-02T05:19:22.000Z","key":""},{"_id":"679f8824bab3e2095199b4b7","id":"axonstan/LogoDet-3K","author":"axonstan","disabled":false,"gated":false,"lastModified":"2025-02-03T13:44:31.000Z","likes":1,"trendingScore":1,"private":false,"sha":"a8c93205dce11454c05334ce0c2969418c5247fa","description":"\n\t\n\t\t\n\t\tDataset Card for LogoDet-3K\n\t\n\nLogoDet-3K dataset aims on logotype (image) detection task.\n\n\n\n\t\n\t\t\n\t\tDataset Description\n\t\n\nLogoDet-3K consists of thousand images with brands' logotypes and their bounding boxes. This dataset aims to help train logotype detection models.\n\n\n\n\n\nLicense: MIT\n\n\n\n\n\t\n\t\t\n\t\tDataset Usage\n\t\n\nYou can download this dataset by the following command (make sure that you have installed Huggingface Datasets):\nfrom datasets import load_dataset\n\ndataset =… See the full description on the dataset page: https://huggingface.co/datasets/axonstan/LogoDet-3K.","downloads":323,"tags":["task_categories:image-feature-extraction","task_categories:image-segmentation","language:en","license:mit","size_categories:100K<n<1M","format:parquet","modality:image","modality:text","library:datasets","library:dask","library:mlcroissant","library:polars","region:us","legal"],"createdAt":"2025-02-02T14:58:44.000Z","key":""},{"_id":"67a0aa3184006c8c685b5f17","id":"tsnngw/CaReSound","author":"tsnngw","disabled":false,"gated":false,"lastModified":"2025-06-26T18:01:26.000Z","likes":3,"trendingScore":1,"private":false,"sha":"6fc529c9cb4d815d659988bdab231ad3c99caddc","description":"\n\t\n\t\t\n\t\tCaReSound\n\t\n\nCaReSound is a benchmark dataset designed for open-ended diagnostic reasoning using cardiac and respiratory auscultation audio. It includes annotated medical sound recordings enriched with metadata and automatically generated question-answer (QA) pairs, enabling research into audio-language modeling for healthcare.\n\n\t\n\t\t\n\t\tDataset Highlights\n\t\n\n\nMultimodal: Each sample pairs an auscultation audio clip with metadata and diagnostic question-answer pairs.\nDiverse: Built from… See the full description on the dataset page: https://huggingface.co/datasets/tsnngw/CaReSound.","downloads":40,"tags":["task_categories:question-answering","language:en","license:apache-2.0","size_categories:10K<n<100K","format:csv","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us"],"createdAt":"2025-02-03T11:36:17.000Z","key":""},{"_id":"67a326d301694197148a21d5","id":"hqfang/rlbench-18-tasks","author":"hqfang","disabled":false,"gated":false,"lastModified":"2025-02-07T12:29:21.000Z","likes":7,"trendingScore":1,"private":false,"sha":"ddb141a10147929b3353881f73057b11931d3466","description":"\n\t\n\t\t\n\t\tRLBench 18 Tasks Dataset\n\t\n\n\n\t\n\t\t\n\t\tOverview\n\t\n\nThis repository provides the RLBench dataset for 18 tasks, originally hosted by PerAct in Google Drive. Since downloading large files from Google Drive via terminal can be problematic due to various limits, we have mirrored the dataset on Hugging Face for easier access. To know more about the details of this dataset, please refer to PerAct.\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Structure\n\t\n\nThe dataset is organized into three splits:\ndata/\n├── train/  #… See the full description on the dataset page: https://huggingface.co/datasets/hqfang/rlbench-18-tasks.","downloads":1518,"tags":["license:apache-2.0","region:us"],"createdAt":"2025-02-05T08:52:35.000Z","key":""},{"_id":"67a3889c0e3bb7d3b0f99881","id":"KadamParth/NCERT_Economics_12th","author":"KadamParth","disabled":false,"gated":false,"lastModified":"2025-02-25T19:35:24.000Z","likes":2,"trendingScore":1,"private":false,"sha":"c3adec00fe3b00752be90d22676ad43e9fc29b73","downloads":49,"tags":["task_categories:question-answering","task_categories:summarization","task_categories:text-generation","language:en","license:mit","size_categories:1K<n<10K","format:csv","modality:tabular","modality:text","library:datasets","library:pandas","library:mlcroissant","library:polars","region:us","ncert","economics","educational","intelligent_tutoring_system","its"],"createdAt":"2025-02-05T15:49:48.000Z","key":""}]