sciencescraper.pmc#

This subpackage contains modules for scraping articles from PubMed Central.

Submodules#

  • pmc_extract: Functions to extract information from the raw XML text of a PubMed Central article.

  • pmc_scrape: Functions for retrieving the clean text of PubMed Central articles.

  • pmc_search: Functions for searching for articles on PubMed Central.

The main functions of this subpackage are get_article_info in pmc_scrape and search_pmc in pmc_search. get_article_info retrieves the full text of a PubMed Central article using the PubMed Central API and returns a dictionary containing the article’s information along with the full text of the article. search_pmc searches PubMed Central for articles based on a query and returns a list of PMC IDs for the search results.

Submodules#

Package Contents#

Functions#

get_article_info(pmc_id[, chunk_size])

Fetches and parses an article from PMC given a PMC ID

get_full_text(pmc_id[, chunk_size])

Fetches the full text of an article from PMC given a PMC ID

search_pmc(query[, sort, mindate, maxdate, reldate, ...])

Searches PMC for articles given a query

check_new_articles(query, days[, chunk_size])

Get open access articles from PubMed Central that have been published after a specified date.

sciencescraper.pmc.get_article_info(pmc_id, chunk_size=None)[source]#

Fetches and parses an article from PMC given a PMC ID

Parameters:
  • pmc_id (str) – The PMC ID of the article

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.

Returns:

article – The parsed article

Return type:

dict

sciencescraper.pmc.get_full_text(pmc_id, chunk_size=None)[source]#

Fetches the full text of an article from PMC given a PMC ID

Parameters:
  • pmc_id (str) – The PMC ID of the article

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.

Returns:

full_text – The full text of the article

Return type:

str

sciencescraper.pmc.search_pmc(query, sort='relevance', mindate=None, maxdate=None, reldate=None, retstart=0, retmax=20)[source]#

Searches PMC for articles given a query

Parameters:
  • query (str) – The query to search for

  • sort (str, optional) – The sorting order for the search results. Options are: - “relevance”: Sort by relevance - “pub_date”: Sort by publication date in descending order - “JournalName”: Sort by journal in ascending order - “Author”: Sort by first author in ascending order

  • mindate (str, optional) – The minimum date for the search results. Format is “YYYY/MM/DD”, “YYYY/MM”, or “YYYY”. Must also provide maxdate

  • maxdate (str, optional) – The maximum date for the search results. Format is “YYYY/MM/DD”, “YYYY/MM”, or “YYYY”. Must also provide mindate

  • reldate (str, optional) – The number of days to search back from the current date.

  • retstart (int, optional) – The index of the first article to return

  • retmax (int, optional) – The maximum number of articles to return

Returns:

pmc_ids – The PMC IDs of the search results

Return type:

list

sciencescraper.pmc.check_new_articles(query, days, chunk_size=None)[source]#

Get open access articles from PubMed Central that have been published after a specified date.

Parameters:
  • query (str) – The query to search for

  • days (int) – The number of days to search back from the current date.

  • chunk_size (int, optional) – The size of the chunks to split the full text into

Returns:

pmc_articles – A list of dictionaries containing article information

Return type:

list of dict