sciencescraper.sciencedirect#

This subpackage contains modules for scraping articles from ScienceDirect.

Submodules#

  • scidir_extract: Functions to extract information from the raw XML text of a ScienceDirect article.

  • scidir_clean: Functions to clean the text extracted from ScienceDirect articles.

  • scidir_scrape: Functions for retrieving the clean text of ScienceDirect articles.

The main function of this subpackage is get_article_info in scidir_scrape. This function retrieves the full text of a ScienceDirect article using the ScienceDirect API and returns a dictionary containing the title, authors, journal, year, URL, open access status, keywords, abstract, methods, results, discussion, and references of the article. search_scidir in scidir_search can be used to search for articles on ScienceDirect given a query string and a start date to search from.

Submodules#

Package Contents#

Functions#

get_article_info(api_key[, doi, pii, url, chunk_size])

Get the full text of a ScienceDirect article using the ScienceDirect API.

get_full_text(api_key[, doi, pii, url, chunk_size])

Get the full text of a ScienceDirect article using the ScienceDirect API.

check_new_articles(api_key, query, days)

Check for new articles in Elsevier's ScienceDirect database and notify the user of any new articles.

search_scidir(api_key, query[, sortBy, startDate, ...])

Get articles from Elsevier's ScienceDirect database that are relevant to a specified search query.

sciencescraper.sciencedirect.get_article_info(api_key, doi=None, pii=None, url=None, chunk_size=None)[source]#

Get the full text of a ScienceDirect article using the ScienceDirect API.

Parameters:
  • api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.

  • doi (str, optional) – The DOI of the article to be scraped.

  • pii (str, optional) – The PII of the article to be scraped.

  • url (str, optional) – The URL of the article to be scraped.

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.

Returns:

A dictionary containing the title, authors, journal, year, URL, open access status, keywords, abstract, methods, results, discussion, and references of the article.

Return type:

dict

sciencescraper.sciencedirect.get_full_text(api_key, doi=None, pii=None, url=None, chunk_size=None)[source]#

Get the full text of a ScienceDirect article using the ScienceDirect API.

Parameters:
  • api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.

  • doi (str, optional) – The DOI of the article to be scraped.

  • pii (str, optional) – The PII of the article to be scraped.

  • url (str, optional) – The URL of the article to be scraped.

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.

Returns:

The full text of the article.

Return type:

str

sciencescraper.sciencedirect.check_new_articles(api_key, query, days)[source]#

Check for new articles in Elsevier’s ScienceDirect database and notify the user of any new articles.

Parameters:
  • api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.

  • query (str) – The search query to be used to search for new articles.

  • days (int) – The number of days to search for new articles.

Returns:

A list of dictionaries containing the title, authors, journal, year, URL, open access status, keywords, abstract, methods, results, discussion, and references of the new articles.

Return type:

list of dict

sciencescraper.sciencedirect.search_scidir(api_key, query, sortBy='relevance', startDate=None, max_results=25, offset=0)[source]#

Get articles from Elsevier’s ScienceDirect database that are relevant to a specified search query.

Parameters:
  • api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.

  • query (str) – The search query to be used to search for articles.

  • sortBy (str, optional) – The sorting order for the search results. Options are: - “relevance”: Sort by relevance - “date”: Sort by date Default is “relevance”.

  • startDate (str, optional) – The start date for the search query in the format ‘YYYY-MM-DD’.

  • max_results (int, optional) – The maximum number of results to return. Default is 25. Permitted values: 10, 25, 50, 100.

  • offset (int, optional) – The number of results to skip. Default is 0.

Return type:

list of DOIs of the articles