sciencescraper.sciencedirect
#
This subpackage contains modules for scraping articles from ScienceDirect.
Submodules#
scidir_extract
: Functions to extract information from the raw XML text of a ScienceDirect article.scidir_clean
: Functions to clean the text extracted from ScienceDirect articles.scidir_scrape
: Functions for retrieving the clean text of ScienceDirect articles.
The main function of this subpackage is get_article_info
in scidir_scrape
. This function retrieves the
full text of a ScienceDirect article using the ScienceDirect API and returns a dictionary containing the title, authors,
journal, year, URL, open access status, keywords, abstract, methods, results, discussion, and references of the article.
search_scidir
in scidir_search
can be used to search for articles on ScienceDirect given a query string and a start date to
search from.
Submodules#
Package Contents#
Functions#
|
Get the full text of a ScienceDirect article using the ScienceDirect API. |
|
Get the full text of a ScienceDirect article using the ScienceDirect API. |
|
Check for new articles in Elsevier's ScienceDirect database and notify the user of any new articles. |
|
Get articles from Elsevier's ScienceDirect database that are relevant to a specified search query. |
- sciencescraper.sciencedirect.get_article_info(api_key, doi=None, pii=None, url=None, chunk_size=None)[source]#
Get the full text of a ScienceDirect article using the ScienceDirect API.
- Parameters:
api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.
doi (str, optional) – The DOI of the article to be scraped.
pii (str, optional) – The PII of the article to be scraped.
url (str, optional) – The URL of the article to be scraped.
chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.
- Returns:
A dictionary containing the title, authors, journal, year, URL, open access status, keywords, abstract, methods, results, discussion, and references of the article.
- Return type:
dict
- sciencescraper.sciencedirect.get_full_text(api_key, doi=None, pii=None, url=None, chunk_size=None)[source]#
Get the full text of a ScienceDirect article using the ScienceDirect API.
- Parameters:
api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.
doi (str, optional) – The DOI of the article to be scraped.
pii (str, optional) – The PII of the article to be scraped.
url (str, optional) – The URL of the article to be scraped.
chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.
- Returns:
The full text of the article.
- Return type:
str
- sciencescraper.sciencedirect.check_new_articles(api_key, query, days)[source]#
Check for new articles in Elsevier’s ScienceDirect database and notify the user of any new articles.
- Parameters:
api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.
query (str) – The search query to be used to search for new articles.
days (int) – The number of days to search for new articles.
- Returns:
A list of dictionaries containing the title, authors, journal, year, URL, open access status, keywords, abstract, methods, results, discussion, and references of the new articles.
- Return type:
list of dict
- sciencescraper.sciencedirect.search_scidir(api_key, query, sortBy='relevance', startDate=None, max_results=25, offset=0)[source]#
Get articles from Elsevier’s ScienceDirect database that are relevant to a specified search query.
- Parameters:
api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.
query (str) – The search query to be used to search for articles.
sortBy (str, optional) – The sorting order for the search results. Options are: - “relevance”: Sort by relevance - “date”: Sort by date Default is “relevance”.
startDate (str, optional) – The start date for the search query in the format ‘YYYY-MM-DD’.
max_results (int, optional) – The maximum number of results to return. Default is 25. Permitted values: 10, 25, 50, 100.
offset (int, optional) – The number of results to skip. Default is 0.
- Return type:
list of DOIs of the articles