sciencescraper.pmc.pmc_scrape#

Functions for retrieving the raw text of PubMed Central articles.

Module Contents#

Functions#

fetch_pmc_article(pmc_id)

Fetches an article from PMC given a PMC ID

parse_pmc_article(pmc_article, chunk_size)

Parses an article from PMC

get_article_info(pmc_id[, chunk_size])

Fetches and parses an article from PMC given a PMC ID

get_full_text(pmc_id[, chunk_size])

Fetches the full text of an article from PMC given a PMC ID

sciencescraper.pmc.pmc_scrape.fetch_pmc_article(pmc_id)[source]#

Fetches an article from PMC given a PMC ID

Parameters:

pmc_id (str) – The PMC ID of the article

Returns:

soup – The article as a BeautifulSoup object

Return type:

BeautifulSoup

sciencescraper.pmc.pmc_scrape.parse_pmc_article(pmc_article, chunk_size)[source]#

Parses an article from PMC

Parameters:
  • pmc_article (BeautifulSoup) – The article as a BeautifulSoup object

  • chunk_size (int) – The size of the chunks to split the full text into

Returns:

article – The parsed article

Return type:

dict

sciencescraper.pmc.pmc_scrape.get_article_info(pmc_id, chunk_size=None)[source]#

Fetches and parses an article from PMC given a PMC ID

Parameters:
  • pmc_id (str) – The PMC ID of the article

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.

Returns:

article – The parsed article

Return type:

dict

sciencescraper.pmc.pmc_scrape.get_full_text(pmc_id, chunk_size=None)[source]#

Fetches the full text of an article from PMC given a PMC ID

Parameters:
  • pmc_id (str) – The PMC ID of the article

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.

Returns:

full_text – The full text of the article

Return type:

str