{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyNcBBjLNO0qsTtyM58i0pI/"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["---\n","
Mathematics in the context of AI
\n","
Information Theory
\n","\n","---\n","\n","Joaquin Carbonara 4/28/2024\n","\n","---"],"metadata":{"id":"p2HS6vUV-wvY"}},{"cell_type":"markdown","source":["The surprise (or entropy) of a possible outcome is KIND OF the opposite of the probability of a possible outcome. The more surprise, the less likely the outcome. As commonly done in math, the surprise is defined using tools that KIND OF do the job. So if the probability of an outcome $i$ is $p_i$, then the surprise is $\\log_2(1/p_i)$ if $p_i>0$ and $0$ if $p_i=0$.\n","\n","Expected surprise of a \"system\" is just the weighted average of the surprise:\n","\n","$H(X) = \\sum\n","\\begin{cases}\n","p_i \\log_2 \\frac{1}{p_i} & \\text{if } p_i > 0, \\\\\n","0 & \\text{if } p_i = 0.\n","\\end{cases}$\n","\n","or equivalently\n","\n","$H(X) = -\\sum\n","\\begin{cases}\n","p_i \\log_2 p_i & \\text{if } p_i > 0, \\\\\n","0 & \\text{if } p_i = 0.\n","\\end{cases}$\n","\n"],"metadata":{"id":"jAalhJzyDJp5"}},{"cell_type":"code","source":["# prompt: please write code to find the entropy of flipping a fair coin\n","\n","import math\n","def entropy_fair_coin():\n"," \"\"\"Calculates the entropy of flipping a fair coin.\n","\n"," Returns:\n"," The entropy in bits.\n"," \"\"\"\n","\n"," # The probability of each outcome is 1/2.\n"," p_heads = 1/2\n"," p_tails = 1/2\n","\n"," # Calculate the entropy using the formula:\n"," # H(X) = -sum(p_i * log2(p_i))\n","\n"," entropy = -(p_heads * math.log2(p_heads) + p_tails * math.log2(p_tails))\n","\n"," return entropy\n","\n","if __name__ == \"__main__\":\n"," entropy = entropy_fair_coin()\n"," print(\"The entropy of flipping a fair coin is:\", entropy)\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"GbXinDXCvB4p","executionInfo":{"status":"ok","timestamp":1714308297498,"user_tz":240,"elapsed":229,"user":{"displayName":"Joaquin Carbonara","userId":"14921749483686740647"}},"outputId":"8618bfb9-6281-449e-c83a-f26ef41b2965"},"execution_count":3,"outputs":[{"output_type":"stream","name":"stdout","text":["The entropy of flipping a fair coin is: 1.0\n"]}]},{"cell_type":"code","execution_count":2,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"hbEGohGg-taj","executionInfo":{"status":"ok","timestamp":1714308234005,"user_tz":240,"elapsed":230,"user":{"displayName":"Joaquin Carbonara","userId":"14921749483686740647"}},"outputId":"2709d429-6fa2-4a45-e953-22b0da36bfaf"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["1.0"]},"metadata":{},"execution_count":2}],"source":["from math import log\n","p=[0.5,0.5]\n","OC={'H':p[0],'T':p[1]}\n","SS={'H':p[0]*log(p[0],2),'T':p[1]*log(p[1],2)}\n","EN=-sum(SS.values())\n","EN"]},{"cell_type":"markdown","source":[" EXAMPLE: Compute the entropy of several books and compare them:\n","\n","\n","Adventures of Huckleberry Finn by Mark Twain\n","https://www.gutenberg.org/ebooks/76.txt.utf-8\n","\n","Hamlet, Prince of Denmark by William Shakespeare\n","https://www.gutenberg.org/ebooks/1524.txt.utf-8\n","\n","Alice's Adventures in Wonderland by Lewis Carroll\n","https://www.gutenberg.org/ebooks/11.txt.utf-8\n","\n"],"metadata":{"id":"giBwQ4JYpTY1"}},{"cell_type":"markdown","source":["NOTE: Prompt an LLM to get started if you wish:\n","\n",">Download the book https://www.gutenberg.org/ebooks/76.txt.utf-8, find the frequency of the words, and then find the entropy of the frequency distribution of the words\n","\n","NOTE: gutenberg.org is very strict about web scraping. So be careful."],"metadata":{"id":"zaW4m57BPDaZ"}},{"cell_type":"code","source":["def ent(book):\n","\n"," import requests\n"," from collections import Counter\n","\n"," # Download the book\n"," response = requests.get(book)\n","\n"," # Extract the text from the response\n"," text = response.text\n","\n"," # Split the text into words\n"," words = text.split()\n","\n"," # Count the frequency of each word\n"," word_counts = Counter(words)\n","\n"," # Calculate the probability of each word\n"," total_words = len(words)\n"," word_probabilities = {word: count / total_words for word, count in word_counts.items()}\n","\n"," # Calculate the entropy\n"," entropy = 0\n"," for probability in word_probabilities.values():\n"," if probability > 0:\n"," entropy += probability * log(1 / probability, 2)\n"," max_ent=log(total_words,2)\n"," return [entropy,max_ent,entropy/max_ent]"],"metadata":{"id":"8po42g_pqQ6D","executionInfo":{"status":"ok","timestamp":1714266170415,"user_tz":240,"elapsed":165,"user":{"displayName":"Joaquin Carbonara","userId":"14921749483686740647"}}},"execution_count":132,"outputs":[]},{"cell_type":"code","source":["# gutemberg.org list of books in text form\n","books=\\\n"," [[\"Adventures of Huckleberry Finn\",\"Mark Twain\", \"https://www.gutenberg.org/ebooks/76.txt.utf-8\"],\n"," [\"Hamlet, Prince of Denmark\" , \"William Shakespeare\" ,\"https://www.gutenberg.org/ebooks/1524.txt.utf-8\"],\n"," [\"Alice's Adventures in Wonderland\" , \"Lewis Carroll\" ,\"https://www.gutenberg.org/ebooks/11.txt.utf-8\"],\n"," [\"STRUWWELPETER\",\"Heinrich Hoffman\",\"https://www.gutenberg.org/cache/epub/12116/pg12116.txt\"]]\n"],"metadata":{"id":"xp2uP9w1TUSM","executionInfo":{"status":"ok","timestamp":1714251279510,"user_tz":240,"elapsed":163,"user":{"displayName":"Joaquin Carbonara","userId":"14921749483686740647"}}},"execution_count":130,"outputs":[]},{"cell_type":"code","source":["import pandas as pd\n","entropy=[]\n","for book in books:\n"," entropy+=[[book[0]]+[book[1]]+ent(book[2])]\n","df_entropy=pd.DataFrame(entropy,columns=[\"title\",\"author\",\"entropy\",\"max_entropy\",\"rel_entropy\"])\n","df_entropy"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":175},"id":"9iUCX83_qp8V","executionInfo":{"status":"ok","timestamp":1714251284686,"user_tz":240,"elapsed":2929,"user":{"displayName":"Joaquin Carbonara","userId":"14921749483686740647"}},"outputId":"9477c0af-832c-4108-dc26-4069bfe801d2"},"execution_count":131,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" title author entropy \\\n","0 Adventures of Huckleberry Finn Mark Twain 9.915130 \n","1 Hamlet, Prince of Denmark William Shakespeare 10.414850 \n","2 Alice's Adventures in Wonderland Lewis Carroll 9.822476 \n","3 STRUWWELPETER Heinrich Hoffman 9.352131 \n","\n"," max_entropy rel_entropy \n","0 16.800268 0.590177 \n","1 15.094944 0.689956 \n","2 14.851554 0.661377 \n","3 12.405939 0.753843 "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
titleauthorentropymax_entropyrel_entropy
0Adventures of Huckleberry FinnMark Twain9.91513016.8002680.590177
1Hamlet, Prince of DenmarkWilliam Shakespeare10.41485015.0949440.689956
2Alice's Adventures in WonderlandLewis Carroll9.82247614.8515540.661377
3STRUWWELPETERHeinrich Hoffman9.35213112.4059390.753843
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","\n","
\n"," \n"," \n"," \n","
\n","\n","
\n","
\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","variable_name":"df_entropy","summary":"{\n \"name\": \"df_entropy\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Hamlet, Prince of Denmark\",\n \"STRUWWELPETER\",\n \"Adventures of Huckleberry Finn\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"author\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"William Shakespeare\",\n \"Heinrich Hoffman\",\n \"Mark Twain\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"entropy\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.4355812537707372,\n \"min\": 9.352131293101166,\n \"max\": 10.414849529306283,\n \"num_unique_values\": 4,\n \"samples\": [\n 10.414849529306283,\n 9.352131293101166,\n 9.915129830471765\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"max_entropy\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.8093911921379875,\n \"min\": 12.405939193342398,\n \"max\": 16.800267975949442,\n \"num_unique_values\": 4,\n \"samples\": [\n 15.09494363673212,\n 12.405939193342398,\n 16.800267975949442\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rel_entropy\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.06786031033216408,\n \"min\": 0.5901768855512215,\n \"max\": 0.753843070431939,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.6899561720762395,\n 0.753843070431939,\n 0.5901768855512215\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"}},"metadata":{},"execution_count":131}]}]}