graviti
Products
Resources
About us
curationCorpus
Text
NLP
|...
License: CC BY 4.0

Overview

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles,
with links to the articles themselves. This repository provides a scraper to access them. If
you're interested in commercial use or access to the wider catalogue of Curation data, including
a larger set of over 150,000 professionally-written abstracts and a scalable, on-demand content
abstraction API (driven by humans or AI), please get in touch. For our thoughts on how we hope
this release will help the NLP community, see our post introducing the dataset.

Instruction

  • Clone this repository (or just copy the code from scraper.py)
  • Download the urls, headlines, and summaries from here
  • Run web_scraper.py. Give as command line arguments the path to the csv file without article
    text, the path to a new csv file which will have article text, and a batch size to determine
    how many urls it will scrape at a time. Larger batch sizes will make it run faster but it may
    drop more articles due to timeouts. I recommend ~50 on a 2015 Macbook Pro.
git clone https://github.com/CurationCorp/curation-corpus.git
cd curation-corpus
wget https://curation-datasets.s3-eu-west-1.amazonaws.com/curation-corpus-base.csv
python web_scraper.py curation-corpus-base.csv curation-corpus-base-with-articles.csv 50

Some urls will
return messy results due to content changing over time, paywalls, etc. We've tried to remove
the worst offenders from this release. There is probably still scope though for improving the
scraper though.

Citation

@misc{curationcorpusbase:2020,
  title={Curation Corpus Base},
  author={Curation},
  year={2020}
}

License

CC BY 4.0

Data Summary
Type
Text,
Amount
40K
Size
123.13MB
Provided by
Henry Dashwood
ML at Curation. Previously I interned at Talkspace and Sharemob. I keep personal projects here.
| Amount 40K | Size 123.13MB
curationCorpus
Text
NLP
License: CC BY 4.0

Overview

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles,
with links to the articles themselves. This repository provides a scraper to access them. If
you're interested in commercial use or access to the wider catalogue of Curation data, including
a larger set of over 150,000 professionally-written abstracts and a scalable, on-demand content
abstraction API (driven by humans or AI), please get in touch. For our thoughts on how we hope
this release will help the NLP community, see our post introducing the dataset.

Instruction

  • Clone this repository (or just copy the code from scraper.py)
  • Download the urls, headlines, and summaries from here
  • Run web_scraper.py. Give as command line arguments the path to the csv file without article
    text, the path to a new csv file which will have article text, and a batch size to determine
    how many urls it will scrape at a time. Larger batch sizes will make it run faster but it may
    drop more articles due to timeouts. I recommend ~50 on a 2015 Macbook Pro.
git clone https://github.com/CurationCorp/curation-corpus.git
cd curation-corpus
wget https://curation-datasets.s3-eu-west-1.amazonaws.com/curation-corpus-base.csv
python web_scraper.py curation-corpus-base.csv curation-corpus-base-with-articles.csv 50

Some urls will
return messy results due to content changing over time, paywalls, etc. We've tried to remove
the worst offenders from this release. There is probably still scope though for improving the
scraper though.

Citation

@misc{curationcorpusbase:2020,
  title={Curation Corpus Base},
  author={Curation},
  year={2020}
}

License

CC BY 4.0

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti