Module ntfp.ntfp

it's like Nimbus but uses a transformer language model

Implemented in a functional programming style

Resources

Expand source code
#!/usr/bin/env python3
"""it's like [Nimbus][1] but uses [a transformer language model][2]

Implemented in a [functional programming style][4]

[//]: # (markdown comment # noqa)

Resources:
    * [import typing][python3_typing]
    * [import functools][python3_functools]
    * [Functional Design Patterns - Scott Wlaschin][wlaschin_talk]
    * ["Types are not classes... they're sort of like Sets"][wlaschin_talk_types]
    * [Why Isn't Functional Programming the Norm? – Richard Feldman][richard_feldman_talk]
    * ["NewType declares one type to be a _subtype_ of another"][new_type]
        * _subtype_ means the same thing as _subclass_ in this context
    * [__pdoc__override]
    * [pyre-check error suppression][5]
    * [mypy type hints cheat sheet][6]
    * [Carl Meyer - Type-checked Python in the real world - PyCon 2018][carl_myer_pycon2018]

[1]: http://github.com/calpoly-csai/api
[2]: https://github.com/huggingface/transformers
[3]: http://github.com/mfekadu/nimbus-transformer

[4]: https://realpython.com/courses/functional-programming-python/

[5]: https://pyre-check.org/docs/error-suppression.html

[6]: https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html

[carl_myer_pycon2018]: https://youtu.be/pMgmKJyWKn8

[python3_typing]: https://docs.python.org/3/library/typing.html
[python3_functools]: https://docs.python.org/3/library/functools.html

[wlaschin_talk]: https://youtu.be/ucnWLfBA1dc
[wlaschin_talk_types]: https://youtu.be/ucnWLfBA1dc?t=685

[richard_feldman_talk]: https://youtu.be/QyJZzq0v7Z4

[new_type]: https://docs.python.org/3/library/typing.html#newtype

[__pdoc__override]: https://pdoc3.github.io/pdoc/doc/pdoc/#overriding-docstrings-with-__pdoc__
"""
from typing import Optional, get_type_hints
from bs4 import BeautifulSoup
import googlesearch
from fuzzywuzzy import fuzz
from requests import get
from requests.models import Response
from transformers import pipeline
from typing_extensions import Final
from typing import List, Callable, Iterator, Tuple
from ntfp.ntfp_types import (
    IDK,
    IDK_TYPE,
    Answer,
    Context,
    WebPageContext,
    GooglePage,
    GoogleResultURL,
    GoogleResultURLIterator,
    Query,
    Question,
    SanitizedQuery,
    WebPage,
    URL,
    ExtraDataDict,
)
import spacy


def create_query(question: Question) -> Query:
    """
    [//]: # (markdown comment # noqa)

    Describes how to create a new
        [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
        from a given [`Question`](ntfp_types.html#ntfp.ntfp_types.Question).

    Args:
        question: A [`Question`](ntfp_types.html#ntfp.ntfp_types.Question) string.

    Returns:
        A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string.
    """
    # make a Google query with appropriate scope of domain name
    # by kind-of-sort-of-cast-str-to-Query-via-__init__-but-is-that-really-casting-idk-how-types-work-in-python  # noqa
    query: Query = Query(f"{question} site:calpoly.edu")
    return query


def url_param_sanitize(query: Query) -> SanitizedQuery:
    """Sanitizes the given
        [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
        for use in a [`URL`](ntfp_types.html#ntfp.ntfp_types.URL)
        as an HTTP parameter in a HTTP GET request.

    [//]: # (markdown comment # noqa)

    Args:
        query: A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
            that would be typed into the Google Search box,
            which is expected to be used as a URL parameter.

    Example:
        >>> query: Query = Query("what is foaad khosmood's email? site:calpoly.edu")
        >>> url_param_sanitize(query)
        what+is+foaad+khosmood%27s+email%3F+site%3Acalpoly.edu
        >>> url_param_sanitize("a!a@a#a$a%a^a&a*a(a)a_a+a a")
        'a%21a%40a%23a%24a%25a%5Ea%26a%2Aa%28a%29a_a%2Ba+a'

    Returns:
        A [`SanitizedQuery`](ntfp_types.html#ntfp.ntfp_types.SanitizedQuery) \
            string such that spaces are converted to `+` \
                and special characters into their appropriate codes.

    """
    return SanitizedQuery(googlesearch.quote_plus(query))  # pyre-ignore[16]


def get_page(url: URL, verbose=False) -> WebPage:
    """Returns the html \
        [`WebPage`](ntfp_types.html#ntfp.ntfp_types.WebPage) \
        of the given [`URL`](ntfp_types.html#ntfp.ntfp_types.URL).
    """
    if url.endswith("pdf"):
        if verbose:
            print("skipping PDF file && returning empty WebPage...")
        # TODO: consider returning None? but then return is Optional[WebPage]
        # TODO: consider having create_query avoid PDFs via "-filetype:pdf"
        # TODO: but also consider that Google can get GoogleContext from PDFs,
        #     :    which is good
        # TODO: but for sure get_page should avoid PDFs.
        #     :    unless we can import some fancy PDF OCR package to handle it
        return WebPage("")
    response: Response = get(url)
    html: str = response.text
    return WebPage(html)


def get_google_page(query: Query) -> GooglePage:
    """
    Perform a Google Search and return the html content.

    Args:
        query: A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
            that would be typed into the Google Search box,
            which is expected to be used as a URL parameter.

    Example:
        >>> question: Question = Question("what is foaad email?")
        >>> query: Query = create_query(question)
        >>> query
        ... 'what is foaad email? site:calpoly.edu'
        >>> google_page: GooglePage = get_google_page_page(query)
        >>> google_page
        ... '<html><body><div>...Google...foaad...email...</div></body></html>'
        >>> type(google_page)  # type still str at runtime
        ... <class 'str'>

    Returns:
        A string of HTML representing the [Google Search result][4] \
            [`GooglePage`](ntfp_types.html#ntfp.ntfp_types.GooglePage).

    [4]: http://google.com/search?q=what+is+foaad+email?+site:calpoly.edu
    """
    BASE_GOOGLE_URL: Final[URL] = URL("https://www.google.com/search?q=")

    sanitized_query: SanitizedQuery = url_param_sanitize(query)

    url: URL = URL(f"{BASE_GOOGLE_URL}{sanitized_query}")

    html_page: GooglePage = GooglePage(get_page(url))

    return html_page


def fetch_google_result_urls(
    query: Query, limit: Optional[int] = None
) -> GoogleResultURLIterator:
    """Fetches [`GoogleResultURL`](ntfp_types.html#ntfp.ntfp_types.GoogleResultURL)s \
        from [large list of Google Search results][4].

    [//]: # (markdown comment # noqa)

    Retrieves strings of \
        [`GoogleResultURL`](ntfp_types.html#ntfp.ntfp_types.GoogleResultURL)s \
        pertaining to the given [`Query`](ntfp_types.html#ntfp.ntfp_types.Query).

    Args:
        query: A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
            that would be typed into the Google Search box,
            which is expected to be used as a URL parameter.

        limit: An optional integer for the total number of results to fetch.
            By default `None` means fetch all results that google offers.

    Yields:
        A single \
            [`GoogleResultURL`](ntfp_types.html#ntfp.ntfp_types.GoogleResultURL)
            from the Google Search
            [`GooglePage`](ntfp_types.html#ntfp.ntfp_types.GooglePage).

    Resources:
        * How to type annotate Generators
            * https://stackoverflow.com/q/27264250
            * https://docs.python.org/3/library/typing.html#typing.Generator

    [4]: http://google.com/search?q=what+is+foaad+email?+site:calpoly.edu
    """
    for url in googlesearch.search(
        query,
        num=10,
        stop=limit,  # allows for infinite-ish generation of google results
        country="",  # TODO: consider setting San Luis Obispo if possible?
    ):
        yield GoogleResultURL(url)


def transformer(q: Question, c: Context) -> Tuple[Answer, ExtraDataDict]:
    """transformer

    [//]: # (markdown comment # noqa)

    Resources:
        * HuggingFace Transformers pipelines
            * https://github.com/huggingface/transformers#quick-tour-of-pipelines
    """
    if len(c) <= 0:
        extra_data: ExtraDataDict = {
            "score": -1.0,
            "start": -1,
            "end": -1,
            "tokenizer": "NA_SKIPPED_TRANSFORMER",
            "model": "NA_SKIPPED_TRANSFORMER",
        }
        return (
            Answer(IDK),
            extra_data,
        )
    # FIXME: this line below needs an internet connection!
    nlp = pipeline("question-answering")
    input_data = {"question": q, "context": c}
    answer = nlp(input_data)
    extra_data: ExtraDataDict = {
        "score": answer.get("score", -1.0),
        "start": answer.get("start", -1),
        "end": answer.get("end", -1),
        "tokenizer": nlp.tokenizer.__class__.__name__,
        "model": nlp.model.__class__.__name__,
    }
    return (answer.get("answer", IDK), extra_data)


def extract_webpage_context(
    page: WebPage, only_paragraphs: Optional[bool] = False
) -> WebPageContext:
    """Extracts the text from a given HTML \
        [`WebPage`](ntfp_types.html#ntfp.ntfp_types.WebPage) and returns it as a \
        [`WebPageContext`](ntfp_types.html#ntfp.ntfp_types.WebPageContext).

    [//]: # (markdown comment # noqa)

    Args:
        page: A [`WebPage`](ntfp_types.html#ntfp.ntfp_types.WebPage) HTML string.
        only_paragraphs: An Optional boolean value to specify whether to only \
            look at paragraph tags `<p>`. (Default = False).

    Returns:
        The [`WebPageContext`](ntfp_types.html#ntfp.ntfp_types.WebPageContext) string.

    Example:
        >>> html = "<html><div>Hello World!</div><code>126/3==42</code></html>"
        >>> p: WebPage = WebPage(html)
        >>> wpc: WebPageContext = extract_webpage_context(p)
        >>> wpc
        ... 'Hello World!126/3==42'

    Resources:
        * BeautifulSoup4
            * https://pypi.org/project/beautifulsoup4/
    """
    soup: BeautifulSoup = BeautifulSoup(markup=page, features="html.parser")
    if only_paragraphs is True:
        paragraph_text: str = "".join([p.text for p in soup.find_all("p")])
        return WebPageContext(Context(paragraph_text))
    text: Context = Context(str(soup.text))
    return WebPageContext(text)


class NtfpNoEntityError(Exception):
    """Spacy's named entity recognizer failed to find an entity in a sentence.

    Attributes:
        sentence -- the sentence that lacks an entity.
        message -- explanation of the error
    """

    def __init__(self, sentence, message):
        self.sentence = sentence
        self.messsage = message


def relevance(to, nlp=None, FUZZ_THRESHOLD=30, LEN_THRESHOLD=2):
    FUZZ_THRESHOLD = FUZZ_THRESHOLD or 30
    LEN_THRESHOLD = LEN_THRESHOLD or 2
    # TODO: make smarter filter THRESHOLDS
    # TODO: consider semantic similarity
    original_question = to
    entity_text = None
    if isinstance(nlp, spacy.language.Language):
        doc = nlp(original_question)
        ents = doc.ents
        if len(ents) > 0:
            entity = ents[0]
            entity_text = entity.text
            # metadata = {
            #     "entity_text": entity.text,
            #     "entity_start_char": entity.start_char,
            #     "entity_end_char": entity.end_char,
            #     "entity_label_": entity.label_,
            # }
            if entity_text is None or entity_text == "":
                msg = f"'{original_question}' has no named entity from spacy?"
                raise NtfpNoEntityError(original_question, msg)
        else:
            msg = f"'{original_question}' has no named entity from spacy?"
            raise NtfpNoEntityError(original_question, msg)

    def _filter_func(text):
        text_question_lexical_similarity = fuzz.ratio(text, original_question)
        # TODO: make smarter filter rules
        if entity_text is not None and entity_text in text:
            # ASSUME: answer would contain exact match of entity_text
            text_question_lexical_similarity += FUZZ_THRESHOLD
        if original_question in text:
            # ASSUME: that answer would not include original_question
            return False
        if text_question_lexical_similarity < FUZZ_THRESHOLD:
            # ASSUME: some lexical similarity question with answer
            return False
        if len(text) < LEN_THRESHOLD:
            # ASSUME: text is long enough to contain an answer.
            return False
        return True

    return _filter_func


# fmt:off
def filter_list_by_relevance(
    to: str,
    lst: List[str],
    FUZZ: Optional[int] = None,
    LEN: Optional[int] = None,
    nlp: Optional[object] = None,
) -> Iterator[str]:
    relevant_text_list: Iterator[str] = filter(
        relevance(to=to, FUZZ_THRESHOLD=FUZZ, LEN_THRESHOLD=LEN, nlp=nlp), lst,
    )
    return relevant_text_list
# fmt:on


def filter_string_by_relevance(
    to: str,
    string: str,
    FUZZ: Optional[int] = None,
    LEN: Optional[int] = None,
    limit: Optional[int] = None,
    sep: str = "\n",
    nlp: Optional[object] = None,
) -> str:
    lst = string.split(sep)
    relevant_text_list: Iterator[str] = filter_list_by_relevance(
        to=to, lst=lst, FUZZ=FUZZ, LEN=LEN, nlp=nlp
    )
    return sep.join(sorted(relevant_text_list, reverse=True)[:limit])


def extract_relevant_context(page: WebPage, question: Question) -> Context:
    soup: BeautifulSoup = BeautifulSoup(markup=page, features="html.parser")

    txt_lst: List[str] = [x for x in soup.stripped_strings]

    # Filter by relevance to the question
    relevant_text_list = filter_list_by_relevance(to=question, lst=txt_lst)

    return Context("\n".join(relevant_text_list))


def get_context(
    question: Question, use_google: bool = True, verbose: bool = False
) -> Tuple[Query, WebPage, Context]:
    if use_google:
        query: Query = create_query(question)
        page: GooglePage = get_google_page(query)
        if verbose:
            print("query: ", query, "\n")
            print("len(page): ", len(page), "\n")
        context: Context = extract_relevant_context(page, question)
        return query, page, context
    else:
        raise NotImplementedError


if __name__ == "__main__":
    print()
    print("IDK: ", IDK, "\n")
    # reveal_type(IDK)

    print("type(IDK): ", type(IDK), "\n")
    # reveal_type(type(IDK))

    print("IDK_TYPE: ", IDK_TYPE, "\n")
    # reveal_type(IDK_TYPE)

    print("type(IDK_TYPE): ", type(IDK_TYPE), "\n")
    # reveal_type(type(IDK_TYPE))

    # print("IDK_TYPE_TypeVar: ", IDK_TYPE_TypeVar, "\n")
    # # reveal_type(IDK_TYPE_TypeVar)

    # print("type(IDK_TYPE_TypeVar): ", type(IDK_TYPE_TypeVar), "\n")
    # # reveal_type(type(IDK_TYPE_TypeVar))

    # print('IDK_TYPE("hello"): ', IDK_TYPE("hello"), "\n")

    # print('type(IDK_TYPE("hello")): ', type(IDK_TYPE("hello")), "\n")

    print("transformer: ", transformer, "\n")
    # reveal_type(transformer)

    print("type(transformer): ", type(transformer), "\n")
    # reveal_type(type(transformer))

    # print("Transformer: ", Transformer, "\n")
    # # reveal_type(Transformer)

    # print("type(Transformer): ", type(Transformer), "\n")
    # # reveal_type(type(Transformer))

    print("get_type_hints(transformer): ", get_type_hints(transformer), "\n")
    # reveal_type(get_type_hints(transformer))

    print()

    user_input: str = input("question: ")
    # reveal_type(user_input)

    question: Question = Question(user_input)
    # reveal_type(question)

    query: Query = create_query(question)
    print("query: ", query, "\n")
    print("type(query) == str: ", type(query) == str, "\n")
    # this typing module is not intuitive sometimes.
    # i suppose the reason is that the type checker is static
    # and has no affect on runtime values.
    print("type(query) != Query: ", type(query) != Query, "\n")
    # reveal_type(query)
    # reveal_type(type(query))
    # reveal_type(Query)

    sanitized_query: SanitizedQuery = url_param_sanitize(query)
    print("sanitized_query: ", sanitized_query, "\n")
    # reveal_type(sanitized_query)

    # x = get_google_page(sanitized_query)
    # y = get_google_page(query)
    # assert x[:5] == y[:5]
    # # reveal_type(x)
    # # reveal_type(y)

    google_page: GooglePage = get_google_page(query)

    first_ten_urls: List[GoogleResultURL] = [
        x for x in fetch_google_result_urls(query, limit=10)
    ]  # noqa
    print("first_ten_urls: ", first_ten_urls)
    # reveal_type(first_ten_urls)

    f: Callable[[URL], WebPage] = get_page
    result_pages: List[WebPage] = [f(url) for url in first_ten_urls]
    # reveal_type(result_pages)

    f: Callable[[WebPage], WebPageContext] = extract_webpage_context
    contexts: List[WebPageContext] = [f(page) for page in result_pages]

    large_context: Context = Context("\n\n".join(contexts))

    print(large_context)
    # print(": ", transformer("ok", "cool"))

Functions

def create_query(question)

Describes how to create a new Query string from a given Question.

Args

question
A Question string.

Returns

A Query string.

Expand source code
def create_query(question: Question) -> Query:
    """
    [//]: # (markdown comment # noqa)

    Describes how to create a new
        [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
        from a given [`Question`](ntfp_types.html#ntfp.ntfp_types.Question).

    Args:
        question: A [`Question`](ntfp_types.html#ntfp.ntfp_types.Question) string.

    Returns:
        A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string.
    """
    # make a Google query with appropriate scope of domain name
    # by kind-of-sort-of-cast-str-to-Query-via-__init__-but-is-that-really-casting-idk-how-types-work-in-python  # noqa
    query: Query = Query(f"{question} site:calpoly.edu")
    return query
def extract_relevant_context(page, question)
Expand source code
def extract_relevant_context(page: WebPage, question: Question) -> Context:
    soup: BeautifulSoup = BeautifulSoup(markup=page, features="html.parser")

    txt_lst: List[str] = [x for x in soup.stripped_strings]

    # Filter by relevance to the question
    relevant_text_list = filter_list_by_relevance(to=question, lst=txt_lst)

    return Context("\n".join(relevant_text_list))
def extract_webpage_context(page, only_paragraphs=False)

Extracts the text from a given HTML WebPage and returns it as a WebPageContext.

Args

page
A WebPage HTML string.
only_paragraphs
An Optional boolean value to specify whether to only look at paragraph tags <p>. (Default = False).

Returns

The WebPageContext string.

Example

>>> html = "<html><div>Hello World!</div><code>126/3==42</code></html>"
>>> p: WebPage = WebPage(html)
>>> wpc: WebPageContext = extract_webpage_context(p)
>>> wpc
... 'Hello World!126/3==42'

Resources

Expand source code
def extract_webpage_context(
    page: WebPage, only_paragraphs: Optional[bool] = False
) -> WebPageContext:
    """Extracts the text from a given HTML \
        [`WebPage`](ntfp_types.html#ntfp.ntfp_types.WebPage) and returns it as a \
        [`WebPageContext`](ntfp_types.html#ntfp.ntfp_types.WebPageContext).

    [//]: # (markdown comment # noqa)

    Args:
        page: A [`WebPage`](ntfp_types.html#ntfp.ntfp_types.WebPage) HTML string.
        only_paragraphs: An Optional boolean value to specify whether to only \
            look at paragraph tags `<p>`. (Default = False).

    Returns:
        The [`WebPageContext`](ntfp_types.html#ntfp.ntfp_types.WebPageContext) string.

    Example:
        >>> html = "<html><div>Hello World!</div><code>126/3==42</code></html>"
        >>> p: WebPage = WebPage(html)
        >>> wpc: WebPageContext = extract_webpage_context(p)
        >>> wpc
        ... 'Hello World!126/3==42'

    Resources:
        * BeautifulSoup4
            * https://pypi.org/project/beautifulsoup4/
    """
    soup: BeautifulSoup = BeautifulSoup(markup=page, features="html.parser")
    if only_paragraphs is True:
        paragraph_text: str = "".join([p.text for p in soup.find_all("p")])
        return WebPageContext(Context(paragraph_text))
    text: Context = Context(str(soup.text))
    return WebPageContext(text)
def fetch_google_result_urls(query, limit=None)

Fetches GoogleResultURLs from large list of Google Search results.

Retrieves strings of GoogleResultURLs pertaining to the given Query.

Args

query
A Query string that would be typed into the Google Search box, which is expected to be used as a URL parameter.
limit
An optional integer for the total number of results to fetch. By default None means fetch all results that google offers.

Yields

A single GoogleResultURL from the Google Search GooglePage.

Resources

Expand source code
def fetch_google_result_urls(
    query: Query, limit: Optional[int] = None
) -> GoogleResultURLIterator:
    """Fetches [`GoogleResultURL`](ntfp_types.html#ntfp.ntfp_types.GoogleResultURL)s \
        from [large list of Google Search results][4].

    [//]: # (markdown comment # noqa)

    Retrieves strings of \
        [`GoogleResultURL`](ntfp_types.html#ntfp.ntfp_types.GoogleResultURL)s \
        pertaining to the given [`Query`](ntfp_types.html#ntfp.ntfp_types.Query).

    Args:
        query: A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
            that would be typed into the Google Search box,
            which is expected to be used as a URL parameter.

        limit: An optional integer for the total number of results to fetch.
            By default `None` means fetch all results that google offers.

    Yields:
        A single \
            [`GoogleResultURL`](ntfp_types.html#ntfp.ntfp_types.GoogleResultURL)
            from the Google Search
            [`GooglePage`](ntfp_types.html#ntfp.ntfp_types.GooglePage).

    Resources:
        * How to type annotate Generators
            * https://stackoverflow.com/q/27264250
            * https://docs.python.org/3/library/typing.html#typing.Generator

    [4]: http://google.com/search?q=what+is+foaad+email?+site:calpoly.edu
    """
    for url in googlesearch.search(
        query,
        num=10,
        stop=limit,  # allows for infinite-ish generation of google results
        country="",  # TODO: consider setting San Luis Obispo if possible?
    ):
        yield GoogleResultURL(url)
def filter_list_by_relevance(to, lst, FUZZ=None, LEN=None, nlp=None)
Expand source code
def filter_list_by_relevance(
    to: str,
    lst: List[str],
    FUZZ: Optional[int] = None,
    LEN: Optional[int] = None,
    nlp: Optional[object] = None,
) -> Iterator[str]:
    relevant_text_list: Iterator[str] = filter(
        relevance(to=to, FUZZ_THRESHOLD=FUZZ, LEN_THRESHOLD=LEN, nlp=nlp), lst,
    )
    return relevant_text_list
def filter_string_by_relevance(to, string, FUZZ=None, LEN=None, limit=None, sep='\n', nlp=None)
Expand source code
def filter_string_by_relevance(
    to: str,
    string: str,
    FUZZ: Optional[int] = None,
    LEN: Optional[int] = None,
    limit: Optional[int] = None,
    sep: str = "\n",
    nlp: Optional[object] = None,
) -> str:
    lst = string.split(sep)
    relevant_text_list: Iterator[str] = filter_list_by_relevance(
        to=to, lst=lst, FUZZ=FUZZ, LEN=LEN, nlp=nlp
    )
    return sep.join(sorted(relevant_text_list, reverse=True)[:limit])
def get_context(question, use_google=True, verbose=False)
Expand source code
def get_context(
    question: Question, use_google: bool = True, verbose: bool = False
) -> Tuple[Query, WebPage, Context]:
    if use_google:
        query: Query = create_query(question)
        page: GooglePage = get_google_page(query)
        if verbose:
            print("query: ", query, "\n")
            print("len(page): ", len(page), "\n")
        context: Context = extract_relevant_context(page, question)
        return query, page, context
    else:
        raise NotImplementedError
def get_google_page(query)

Perform a Google Search and return the html content.

Args

query
A Query string that would be typed into the Google Search box, which is expected to be used as a URL parameter.

Example

>>> question: Question = Question("what is foaad email?")
>>> query: Query = create_query(question)
>>> query
... 'what is foaad email? site:calpoly.edu'
>>> google_page: GooglePage = get_google_page_page(query)
>>> google_page
... '<html><body><div>...Google...foaad...email...</div></body></html>'
>>> type(google_page)  # type still str at runtime
... <class 'str'>

Returns

A string of HTML representing the Google Search result GooglePage.

Expand source code
def get_google_page(query: Query) -> GooglePage:
    """
    Perform a Google Search and return the html content.

    Args:
        query: A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
            that would be typed into the Google Search box,
            which is expected to be used as a URL parameter.

    Example:
        >>> question: Question = Question("what is foaad email?")
        >>> query: Query = create_query(question)
        >>> query
        ... 'what is foaad email? site:calpoly.edu'
        >>> google_page: GooglePage = get_google_page_page(query)
        >>> google_page
        ... '<html><body><div>...Google...foaad...email...</div></body></html>'
        >>> type(google_page)  # type still str at runtime
        ... <class 'str'>

    Returns:
        A string of HTML representing the [Google Search result][4] \
            [`GooglePage`](ntfp_types.html#ntfp.ntfp_types.GooglePage).

    [4]: http://google.com/search?q=what+is+foaad+email?+site:calpoly.edu
    """
    BASE_GOOGLE_URL: Final[URL] = URL("https://www.google.com/search?q=")

    sanitized_query: SanitizedQuery = url_param_sanitize(query)

    url: URL = URL(f"{BASE_GOOGLE_URL}{sanitized_query}")

    html_page: GooglePage = GooglePage(get_page(url))

    return html_page
def get_page(url, verbose=False)

Returns the html WebPage of the given URL.

Expand source code
def get_page(url: URL, verbose=False) -> WebPage:
    """Returns the html \
        [`WebPage`](ntfp_types.html#ntfp.ntfp_types.WebPage) \
        of the given [`URL`](ntfp_types.html#ntfp.ntfp_types.URL).
    """
    if url.endswith("pdf"):
        if verbose:
            print("skipping PDF file && returning empty WebPage...")
        # TODO: consider returning None? but then return is Optional[WebPage]
        # TODO: consider having create_query avoid PDFs via "-filetype:pdf"
        # TODO: but also consider that Google can get GoogleContext from PDFs,
        #     :    which is good
        # TODO: but for sure get_page should avoid PDFs.
        #     :    unless we can import some fancy PDF OCR package to handle it
        return WebPage("")
    response: Response = get(url)
    html: str = response.text
    return WebPage(html)
def relevance(to, nlp=None, FUZZ_THRESHOLD=30, LEN_THRESHOLD=2)
Expand source code
def relevance(to, nlp=None, FUZZ_THRESHOLD=30, LEN_THRESHOLD=2):
    FUZZ_THRESHOLD = FUZZ_THRESHOLD or 30
    LEN_THRESHOLD = LEN_THRESHOLD or 2
    # TODO: make smarter filter THRESHOLDS
    # TODO: consider semantic similarity
    original_question = to
    entity_text = None
    if isinstance(nlp, spacy.language.Language):
        doc = nlp(original_question)
        ents = doc.ents
        if len(ents) > 0:
            entity = ents[0]
            entity_text = entity.text
            # metadata = {
            #     "entity_text": entity.text,
            #     "entity_start_char": entity.start_char,
            #     "entity_end_char": entity.end_char,
            #     "entity_label_": entity.label_,
            # }
            if entity_text is None or entity_text == "":
                msg = f"'{original_question}' has no named entity from spacy?"
                raise NtfpNoEntityError(original_question, msg)
        else:
            msg = f"'{original_question}' has no named entity from spacy?"
            raise NtfpNoEntityError(original_question, msg)

    def _filter_func(text):
        text_question_lexical_similarity = fuzz.ratio(text, original_question)
        # TODO: make smarter filter rules
        if entity_text is not None and entity_text in text:
            # ASSUME: answer would contain exact match of entity_text
            text_question_lexical_similarity += FUZZ_THRESHOLD
        if original_question in text:
            # ASSUME: that answer would not include original_question
            return False
        if text_question_lexical_similarity < FUZZ_THRESHOLD:
            # ASSUME: some lexical similarity question with answer
            return False
        if len(text) < LEN_THRESHOLD:
            # ASSUME: text is long enough to contain an answer.
            return False
        return True

    return _filter_func
def transformer(q, c)

transformer

Resources

Expand source code
def transformer(q: Question, c: Context) -> Tuple[Answer, ExtraDataDict]:
    """transformer

    [//]: # (markdown comment # noqa)

    Resources:
        * HuggingFace Transformers pipelines
            * https://github.com/huggingface/transformers#quick-tour-of-pipelines
    """
    if len(c) <= 0:
        extra_data: ExtraDataDict = {
            "score": -1.0,
            "start": -1,
            "end": -1,
            "tokenizer": "NA_SKIPPED_TRANSFORMER",
            "model": "NA_SKIPPED_TRANSFORMER",
        }
        return (
            Answer(IDK),
            extra_data,
        )
    # FIXME: this line below needs an internet connection!
    nlp = pipeline("question-answering")
    input_data = {"question": q, "context": c}
    answer = nlp(input_data)
    extra_data: ExtraDataDict = {
        "score": answer.get("score", -1.0),
        "start": answer.get("start", -1),
        "end": answer.get("end", -1),
        "tokenizer": nlp.tokenizer.__class__.__name__,
        "model": nlp.model.__class__.__name__,
    }
    return (answer.get("answer", IDK), extra_data)
def url_param_sanitize(query)

Sanitizes the given Query string for use in a URL as an HTTP parameter in a HTTP GET request.

Args

query
A Query string that would be typed into the Google Search box, which is expected to be used as a URL parameter.

Example

>>> query: Query = Query("what is foaad khosmood's email? site:calpoly.edu")
>>> url_param_sanitize(query)
what+is+foaad+khosmood%27s+email%3F+site%3Acalpoly.edu
>>> url_param_sanitize("a!a@a#a$a%a^a&a*a(a)a_a+a a")
'a%21a%40a%23a%24a%25a%5Ea%26a%2Aa%28a%29a_a%2Ba+a'

Returns

A SanitizedQuery string such that spaces are converted to + and special characters into their appropriate codes.

Expand source code
def url_param_sanitize(query: Query) -> SanitizedQuery:
    """Sanitizes the given
        [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
        for use in a [`URL`](ntfp_types.html#ntfp.ntfp_types.URL)
        as an HTTP parameter in a HTTP GET request.

    [//]: # (markdown comment # noqa)

    Args:
        query: A [`Query`](ntfp_types.html#ntfp.ntfp_types.Query) string
            that would be typed into the Google Search box,
            which is expected to be used as a URL parameter.

    Example:
        >>> query: Query = Query("what is foaad khosmood's email? site:calpoly.edu")
        >>> url_param_sanitize(query)
        what+is+foaad+khosmood%27s+email%3F+site%3Acalpoly.edu
        >>> url_param_sanitize("a!a@a#a$a%a^a&a*a(a)a_a+a a")
        'a%21a%40a%23a%24a%25a%5Ea%26a%2Aa%28a%29a_a%2Ba+a'

    Returns:
        A [`SanitizedQuery`](ntfp_types.html#ntfp.ntfp_types.SanitizedQuery) \
            string such that spaces are converted to `+` \
                and special characters into their appropriate codes.

    """
    return SanitizedQuery(googlesearch.quote_plus(query))  # pyre-ignore[16]

Classes

class NtfpNoEntityError (sentence, message)

Spacy's named entity recognizer failed to find an entity in a sentence.

Attributes

sentence – the sentence that lacks an entity. message – explanation of the error

Expand source code
class NtfpNoEntityError(Exception):
    """Spacy's named entity recognizer failed to find an entity in a sentence.

    Attributes:
        sentence -- the sentence that lacks an entity.
        message -- explanation of the error
    """

    def __init__(self, sentence, message):
        self.sentence = sentence
        self.messsage = message

Ancestors

  • builtins.Exception
  • builtins.BaseException